Phishing Websites Detector

Simon Weiss - 105366

22/05/2020

Navigation

1.Introduction

What is Phishing?

As COVID-19 spreads around the world, it is clear that the use of the web and online services is accelerating, confirming the importance of this new technology in our modern world.

In a March letter on emergency preparedness in the context of Covid, the European Central Bank warns that the number of cyberthreats has increased dramatically and stresses that the time has come to “assessing risks of increased cyber security related fraud, aimed both to customers or to the institution via phishing mails, etc.”

One of the most widely recognized online security dangers is Phishing attack. The purpose of this fraud is to imitate a real website, for example, internet banking, e-eCommerce, or social networking so as to acquire confidential data such as user-names, passwords, financial and health-related information from potential victims.

What is Phishing ?

What is Phishing ?


Website Phishing

Phishing sites are crafted to lure users into thinking they are on a legitimate website. The goal of a website Phishing is thus to appear as credible as possible so that it is indistinguishable from legitimate websites.

The coarser website phishing will have a distinctive visual sign of the legitimate site as the example below of an Amazon login page. The most successful ones are only recognizable by other characteristics of their own web page, such as the url address, which will not correspond to the server of the legitimate site.

An Amazon Pishing website (coarser) ?

An Amazon Pishing website (coarser) ?


Uci Dataset

It is in this context that we will use the database built by professors Mohammad Rami, McCluskey T.L. of University of Huddersfield and Thabtah Fadi of the Canadian University of Duba and published on the famous UCI Machine Learning Repository.
The database is a collection of website URLs for 11055 websites.
Each sample has 30 website parameters (features) that have proved to be sound and effective in predicting phishing websites and a Result label(target) identifying it as a phishing website or not (respectively -1 or 1).


Problem description

Our problem is thus a supervised binary classification problem. We will divide our dataset into a train and a test samples so as to train models on the dataset and find which models will give us simply the best accuracy score in detecting if a website is a phishing one or not.


Description of the features in dataset

The categories features in our database are divided into 4 main groups.

  • Address Bar based Features
  • Abnormal Based Features
  • HTML and JavaScript based Features
  • Domain based Features

For each feature described below, the variable construction value followed a rule if else and took either value 1,-1,or 0. 0 is when a feature is considered SUSPICIOUS that means it can be either phishy or legitimate.

Let us describe the features of the first categorie used by following the feature descriptions provided by the authors of the dataset.

Address Bar based Features

  • Using the IP Address : If an IP address is used as an alternative of the domain name in the URL, the feature takes the value -1 and 1 otherwise.
  • Long URL to Hide the Suspicious Part : Phishers can use long URL to hide the doubtful part in the address bar. The builders of the dataset calculated the length of URLs in the dataset and produced an average URL length. The results showed that if the length of the URL is greater than or equal 54 characters then the URL classified as phishing. If the URL is between 54 and 74, it is classed as suspicious (0). >54 => 1.
  • Using URL Shortening Services “TinyURL” : if this service is used, the feature takes the value of -1, 1 otherwise.
  • URL’s having “@” Symbol : Using “@” symbol in the URL leads the browser to ignore everything preceding the “@” symbol and the real address often follows the “@” symbol. The feature takes thus -1 if there is a @, 1 otherwise.
  • Redirecting using “//” : The existence of “//” within the URL path means that the user will be redirected to another website. The builders of the dataset found that if the position of “//” within the URL is over 7, the feature takes the value -1.
  • Adding Prefix or Suffix Separated by (-) to the Domain : The dash symbol is rarely used in legitimate URLs. Phishers tend to add prefixes or suffixes separated by (-) to the domain name so that users feel that they are dealing with a legitimate webpage.
  • Sub Domain and Multi Sub Domains : If the number of dots is greater than one, then the URL is classified as “Suspicious” since it has one sub domain. However, if the dots are greater than two, it is classified as “Phishing” since it will have multiple sub domains. Otherwise, if the URL has no sub domains, it will be assigned as “Legitimate” to the feature.
  • HTTPS (Hyper Text Transfer Protocol with Secure Sockets Layer) : The existence of HTTPS is very important in giving the impression of website legitimacy, but this is clearly not enough. The authors checked the certificate assigned with HTTPS including the extent of the trust certificate issuer, and the certificate age. If the certificate is trusted and if the age of the certificate is under 1 year, the feature takes the value 1.
  • Domain Registration Length : Based on the fact that a phishing website lives for a short period of time, the authors found that the longest fraudulent domains have been used for one year only.
  • Favicon : A favicon is a graphic image (icon) associated with a specific webpage. If the favicon is loaded from a domain other than that shown in the address bar, then the webpage is likely to be considered a Phishing attempt.
  • Using Non-Standard Port : This feature is useful in validating if a particular service (e.g. HTTP) is up or down on a specific server. Several firewalls, Proxy and Network Address Translation (NAT) servers will, by default, block all or most of the ports and only open the ones selected. If all ports are open, phishers can run almost any service they want and as a result, user information is threatened.
  • The Existence of “HTTPS” Token in the Domain Part of the URL : The phishers may add the “HTTPS” token to the domain part of a URL in order to trick users.

We hope, that thanks to this part, the reader will better understand the database we use here.
In order not to add too much information in this notebook, since a comprehension of the variables requires a particular technical explanation, we redirect the reader to the complete descrption provided by the authors.


Overview of the project

In order to address the problem described in point 1.4,, I have implemented 3 main types classification algorithms(Trees Classifier, Logistic Regression, Neural Networks). First, we will process the data and do some EDA. Then we will create models and tun our hyperparamter and then we will assess our models and compare it in order to find the best models. From all the models developed , Boosted Tree model has highest accuracy and is followed by Random Forest Classifier and Logistic Regression. So, according to our project, boosted tree would best predict if a website is a phishing website or not.

Have a good reading !


2. Data processing

Load library

library(RWeka)
library(BCA)
library(car)
library(xgboost)
library(ggplot2)
library(randomForest)
library(DataExplorer)
library(caret)
library(tree)
library(extraTrees)
library(xgboost)
library(h2o)
library(nnet)
library(corrplot)
library(Hmisc)
library(rpart)
library(plyr)
library(DT)

Load dataset

dataset <-read.arff(url("https://archive.ics.uci.edu/ml/machine-learning-databases/00327/Training%20Dataset.arff"))
head(dataset)
##   having_IP_Address URL_Length Shortining_Service having_At_Symbol
## 1                -1          1                  1                1
## 2                 1          1                  1                1
## 3                 1          0                  1                1
## 4                 1          0                  1                1
## 5                 1          0                 -1                1
## 6                -1          0                 -1                1
##   double_slash_redirecting Prefix_Suffix having_Sub_Domain SSLfinal_State
## 1                       -1            -1                -1             -1
## 2                        1            -1                 0              1
## 3                        1            -1                -1             -1
## 4                        1            -1                -1             -1
## 5                        1            -1                 1              1
## 6                       -1            -1                 1              1
##   Domain_registeration_length Favicon port HTTPS_token Request_URL
## 1                          -1       1    1          -1           1
## 2                          -1       1    1          -1           1
## 3                          -1       1    1          -1           1
## 4                           1       1    1          -1          -1
## 5                          -1       1    1           1           1
## 6                          -1       1    1          -1           1
##   URL_of_Anchor Links_in_tags SFH Submitting_to_email Abnormal_URL Redirect
## 1            -1             1  -1                  -1           -1        0
## 2             0            -1  -1                   1            1        0
## 3             0            -1  -1                  -1           -1        0
## 4             0             0  -1                   1            1        0
## 5             0             0  -1                   1            1        0
## 6             0             0  -1                  -1           -1        0
##   on_mouseover RightClick popUpWidnow Iframe age_of_domain DNSRecord
## 1            1          1           1      1            -1        -1
## 2            1          1           1      1            -1        -1
## 3            1          1           1      1             1        -1
## 4            1          1           1      1            -1        -1
## 5           -1          1          -1      1            -1        -1
## 6            1          1           1      1             1         1
##   web_traffic Page_Rank Google_Index Links_pointing_to_page Statistical_report
## 1          -1        -1            1                      1                 -1
## 2           0        -1            1                      1                  1
## 3           1        -1            1                      0                 -1
## 4           1        -1            1                     -1                  1
## 5           0        -1            1                      1                  1
## 6           1        -1            1                     -1                 -1
##   Result
## 1     -1
## 2     -1
## 3     -1
## 4     -1
## 5      1
## 6      1
str(dataset)
## 'data.frame':    11055 obs. of  31 variables:
##  $ having_IP_Address          : Factor w/ 2 levels "-1","1": 1 2 2 2 2 1 2 2 2 2 ...
##  $ URL_Length                 : Factor w/ 3 levels "1","0","-1": 1 1 2 2 2 2 2 2 2 1 ...
##  $ Shortining_Service         : Factor w/ 2 levels "1","-1": 1 1 1 1 2 2 2 1 2 2 ...
##  $ having_At_Symbol           : Factor w/ 2 levels "1","-1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ double_slash_redirecting   : Factor w/ 2 levels "-1","1": 1 2 2 2 2 1 2 2 2 2 ...
##  $ Prefix_Suffix              : Factor w/ 2 levels "-1","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ having_Sub_Domain          : Factor w/ 3 levels "-1","0","1": 1 2 1 1 3 3 1 1 3 1 ...
##  $ SSLfinal_State             : Factor w/ 3 levels "-1","1","0": 1 2 1 1 2 2 1 1 2 2 ...
##  $ Domain_registeration_length: Factor w/ 2 levels "-1","1": 1 1 1 2 1 1 2 2 1 1 ...
##  $ Favicon                    : Factor w/ 2 levels "1","-1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ port                       : Factor w/ 2 levels "1","-1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ HTTPS_token                : Factor w/ 2 levels "-1","1": 1 1 1 1 2 1 2 1 1 2 ...
##  $ Request_URL                : Factor w/ 2 levels "1","-1": 1 1 1 2 1 1 2 2 1 1 ...
##  $ URL_of_Anchor              : Factor w/ 3 levels "-1","0","1": 1 2 2 2 2 2 1 2 2 2 ...
##  $ Links_in_tags              : Factor w/ 3 levels "1","-1","0": 1 2 2 3 3 3 3 2 1 1 ...
##  $ SFH                        : Factor w/ 3 levels "-1","1","0": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Submitting_to_email        : Factor w/ 2 levels "-1","1": 1 2 1 2 2 1 1 2 2 2 ...
##  $ Abnormal_URL               : Factor w/ 2 levels "-1","1": 1 2 1 2 2 1 1 2 2 2 ...
##  $ Redirect                   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ on_mouseover               : Factor w/ 2 levels "1","-1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ RightClick                 : Factor w/ 2 levels "1","-1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ popUpWidnow                : Factor w/ 2 levels "1","-1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ Iframe                     : Factor w/ 2 levels "1","-1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age_of_domain              : Factor w/ 2 levels "-1","1": 1 1 2 1 1 2 2 1 2 2 ...
##  $ DNSRecord                  : Factor w/ 2 levels "-1","1": 1 1 1 1 1 2 1 1 1 1 ...
##  $ web_traffic                : Factor w/ 3 levels "-1","0","1": 1 2 3 3 2 3 1 2 3 2 ...
##  $ Page_Rank                  : Factor w/ 2 levels "-1","1": 1 1 1 1 1 1 1 1 2 1 ...
##  $ Google_Index               : Factor w/ 2 levels "1","-1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Links_pointing_to_page     : Factor w/ 3 levels "1","0","-1": 1 1 2 3 1 3 2 2 2 2 ...
##  $ Statistical_report         : Factor w/ 2 levels "-1","1": 1 2 1 2 2 1 1 2 2 2 ...
##  $ Result                     : Factor w/ 2 levels "-1","1": 1 1 1 1 2 2 1 1 2 1 ...
datatable(dataset, filter = 'top',options = list())
plot_intro(dataset)

Like we introduced in 1.5, all our variables are categorical variable and discrete variable. Our dataset has 31 columns and 11,055 lines.


Rename column

For sake of simplicity, we rename our columns by removing the spaces in their names and homogenizing them.

colnames(dataset)
##  [1] "having_IP_Address"           "URL_Length"                 
##  [3] "Shortining_Service"          "having_At_Symbol"           
##  [5] "double_slash_redirecting"    "Prefix_Suffix"              
##  [7] "having_Sub_Domain"           "SSLfinal_State"             
##  [9] "Domain_registeration_length" "Favicon"                    
## [11] "port"                        "HTTPS_token"                
## [13] "Request_URL"                 "URL_of_Anchor"              
## [15] "Links_in_tags"               "SFH"                        
## [17] "Submitting_to_email"         "Abnormal_URL"               
## [19] "Redirect"                    "on_mouseover"               
## [21] "RightClick"                  "popUpWidnow"                
## [23] "Iframe"                      "age_of_domain"              
## [25] "DNSRecord"                   "web_traffic"                
## [27] "Page_Rank"                   "Google_Index"               
## [29] "Links_pointing_to_page"      "Statistical_report"         
## [31] "Result"
cols<-c("HavingIP","LongURL","ShortURL","Symbol","ddRedirecting","PrefixSuffix","SubDomain","HTTPS","DomainRegLen","Favicon","Port","HTTPsToken","RequestURL","AnchorURL", "LinksInTag","SFH","SubEmail","AbnormalURL","Redirect","OnMouseover","RightClick","PopUp","Iframe","AgeOfDomain","DNSRecord","WebTraffic","PageRank","GoogleIndex","LinkToPage","StatsReport","Class")
names(dataset)<-cols
colnames(dataset)
##  [1] "HavingIP"      "LongURL"       "ShortURL"      "Symbol"       
##  [5] "ddRedirecting" "PrefixSuffix"  "SubDomain"     "HTTPS"        
##  [9] "DomainRegLen"  "Favicon"       "Port"          "HTTPsToken"   
## [13] "RequestURL"    "AnchorURL"     "LinksInTag"    "SFH"          
## [17] "SubEmail"      "AbnormalURL"   "Redirect"      "OnMouseover"  
## [21] "RightClick"    "PopUp"         "Iframe"        "AgeOfDomain"  
## [25] "DNSRecord"     "WebTraffic"    "PageRank"      "GoogleIndex"  
## [29] "LinkToPage"    "StatsReport"   "Class"

Let us first check if there is any missing value in our dataset.

Missing Values ?

introduce(dataset)
##    rows columns discrete_columns continuous_columns all_missing_columns
## 1 11055      31               31                  0                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                    0         11055             342705      1394416

There is no Missing Value.


Unbalanced dataset ?

table(dataset$Class)
## 
##   -1    1 
## 4898 6157
prop.table(table(dataset$Class))
## 
##        -1         1 
## 0.4430574 0.5569426

The balance of the classes is not bad. We will not have to implement method to deal with imbalanced dataset like SMOTE.


Feature selection ?

We will see in our third part and with our first trees the importance of variables in our prediction models. We can firstly notice here that this dataset has been build intentionally with features which collectively contribute to deciding if a website is phishing or not. Thus, feature selection will probably not be mandatory.

3. Exploratory Data Analysis

Correlation analysis

corr<-rcorr(as.matrix(dataset))
dataset_coeff = corr$r

corrplot(dataset_coeff, method="square",type="upper", order="hclust", tl.col="black", tl.srt=45)

sort(dataset_coeff[,31],decreasing= TRUE )
##         Class         HTTPS     AnchorURL  PrefixSuffix    WebTraffic 
##  1.000000e+00  7.147412e-01  6.929345e-01  3.486056e-01  3.461031e-01 
##     SubDomain    RequestURL    LinksInTag           SFH   GoogleIndex 
##  2.983233e-01  2.533723e-01  2.482285e-01  2.214190e-01  1.289505e-01 
##   AgeOfDomain      PageRank      HavingIP   StatsReport     DNSRecord 
##  1.214964e-01  1.046449e-01  9.416009e-02  7.985672e-02  7.571775e-02 
##       LongURL        Symbol   OnMouseover          Port    LinkToPage 
##  5.742963e-02  5.294779e-02  4.183844e-02  3.641885e-02  3.257390e-02 
##      SubEmail    RightClick         PopUp       Favicon        Iframe 
##  1.824901e-02  1.265323e-02  8.588679e-05 -2.795247e-04 -3.393524e-03 
##      Redirect ddRedirecting    HTTPsToken   AbnormalURL      ShortURL 
## -2.011346e-02 -3.860761e-02 -3.985390e-02 -6.048764e-02 -6.796589e-02 
##  DomainRegLen 
## -2.257895e-01

We use the first graph and the attached table to identify the variables most correlated with the target.

Although we can notice that some features are highly correlated with each other (>0.5), we choose to keep them for more precision in our model.

We observe that the variables HTTPS and AnchorULR are most the correlated to the target.
Let us plot distribution of Class for the most correlated features to the the target (HTTPS, AnchorURL,PrefixSuffix).


Bar plots

Bar plot for HTTPS by Class

qplot(HTTPS, data=dataset, geom="bar", fill=Class) + 
  theme(legend.position = "top") + 
  theme(axis.text.x=element_text(angle = -20, hjust = 0))

We can see from this graph that the distribution of classes follows a fairly good logic : Phishing Website fall mainly into suspicious website according to Using https and Issuer characteristic. However, we can notice that a small proportion of e phishing website are legitimate according to this feature (i.e. there is fortunately work for our models ! ). Finally, we can describe that all suspicious HTTPS (0) were Phishing Website (class -1)

Let us analyse what happens for our second most correlated features to target AnchorURL.


Bar plot for AnchorURL by Class

qplot(AnchorURL, data=dataset, geom="bar", fill=Class) + 
  theme(legend.position = "top") + 
  theme(axis.text.x=element_text(angle = -20, hjust = 0))

The distribution of Class follows mainly the same rules as before. We can notice that the sites considered as phishing according to the feature are all phishing. A small number of legitimate sites are misclassified with this variable. The vast majority of sites that are considered suspicious according to the feature are legitimate sites (unlike the feature HTTPS)


2 Bar plots for PrefixSuffix and WebTraffic


qplot(PrefixSuffix, data=dataset, geom="bar", fill=Class) + 
  theme(legend.position = "top")

qplot(WebTraffic, data=dataset, geom="bar", fill=Class) + 
  theme(legend.position = "top")


Bar plots for DomainRegLen

What happens for feature the least correlated to target variable ? Let us plot bar plot for DomainRegLen feature.

qplot(DomainRegLen, data=dataset, geom="bar", fill=Class) + 
  theme(legend.position = "top")

As we could have expected, we can here notice that there is a majority of missclaffication according to this feature : -1 Class are classed 1 in the feature and reciprocally.

Now that we have described the relation between feature and target variable, let us plot conclude this EDA part by plotting all variables distribution into bar plot in order to see globally the behavior of our variables.


Plotting Variable frequency.

plot_bar(dataset)

This ends our EDA part. We can move to our 4rd part : building models.


4. Building models

Prepare final dataset

Now that we have passed our first 3 parts, let’s build our databases to build our models.

Recode Target variable

In case of parameterized models, negative label values can make models uneasy. To resolve this problem I converted the -1 values to 0.

dataset <- within(dataset, {
  Class <- Recode(Class, '-1=0', as.factor=TRUE)
})

Split data

dataset$Sample <- create.samples(dataset, est = 0.70, val = 0.30, rand.seed = 1)
trainingset<-dataset[dataset$Sample=="Estimation",]
testset<-dataset[dataset$Sample=="Validation",]
trainingset<-trainingset[,-32]
testset<-testset[,-32]

Tree models

Tree 1 : Classification Tree

Let us build our first classification tree using tree package.

tree1.train <-  tree(Class~.,data=trainingset)

summary(tree1.train)
## 
## Classification tree:
## tree(formula = Class ~ ., data = trainingset)
## Variables actually used in tree construction:
## [1] "HTTPS"      "AnchorURL"  "LinksInTag" "WebTraffic"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.3906 = 3019 / 7730 
## Misclassification error rate: 0.09085 = 703 / 7738

Our first tree reveals the importance of 4 variables in the prediction of our target. HTPS, AnchorURL and WebTraffic were, as we saw previously, the 3 variables the most correlated to the our target class. LinksInTag is also important in the prediction here.

Let us plot our first tree …

plot(tree1.train)
text(tree1.train,pretty = 0)

… and apply our model into test dataset.

tree1.predict<-predict(tree1.train, newdata=testset[,-31], type="class")

Now that we have applied our model, we will plot confusion Matrix and simply use Accuracy score to assess model performance.


c1<-confusionMatrix(factor(tree1.predict),factor(testset$Class))
c1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1188   44
##          1  292 1792
##                                           
##                Accuracy : 0.8987          
##                  95% CI : (0.8879, 0.9087)
##     No Information Rate : 0.5537          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7916          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8027          
##             Specificity : 0.9760          
##          Pos Pred Value : 0.9643          
##          Neg Pred Value : 0.8599          
##              Prevalence : 0.4463          
##          Detection Rate : 0.3583          
##    Detection Prevalence : 0.3715          
##       Balanced Accuracy : 0.8894          
##                                           
##        'Positive' Class : 0               
## 

Let us use cross-validation to prune the tree optimally. We run a K-fold cross-validation using cv.tree experiment to find the deviance or the number of misclassifications as a function of the cost-complexity parameter k.

tree1.val <-  tree(Class~.,data=trainingset)
cv.val1.tree = cv.tree(tree1.val, FUN = prune.tree,K=10)
plot(cv.val1.tree)

We can observe that we have a big drop between 1 and 2 of the deviance. We will pick size 4.

tree1_optimal = prune.tree(tree1.train, best=4)
summary(tree1_optimal)
## 
## Classification tree:
## snip.tree(tree = tree1.train, nodes = c(5L, 7L))
## Variables actually used in tree construction:
## [1] "HTTPS"     "AnchorURL"
## Number of terminal nodes:  4 
## Residual mean deviance:  0.5 = 3867 / 7734 
## Misclassification error rate: 0.0924 = 715 / 7738
summary(tree1.train)
## 
## Classification tree:
## tree(formula = Class ~ ., data = trainingset)
## Variables actually used in tree construction:
## [1] "HTTPS"      "AnchorURL"  "LinksInTag" "WebTraffic"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.3906 = 3019 / 7730 
## Misclassification error rate: 0.09085 = 703 / 7738
plot(tree1_optimal)
text(tree1_optimal, pretty=0)


tree1.predict_optimal<-predict(tree1_optimal, newdata=testset[,-31], type="class")
c1<-confusionMatrix(factor(tree1.predict_optimal),factor(testset$Class))
c1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1323  164
##          1  157 1672
##                                           
##                Accuracy : 0.9032          
##                  95% CI : (0.8926, 0.9131)
##     No Information Rate : 0.5537          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8042          
##                                           
##  Mcnemar's Test P-Value : 0.7377          
##                                           
##             Sensitivity : 0.8939          
##             Specificity : 0.9107          
##          Pos Pred Value : 0.8897          
##          Neg Pred Value : 0.9142          
##              Prevalence : 0.4463          
##          Detection Rate : 0.3990          
##    Detection Prevalence : 0.4484          
##       Balanced Accuracy : 0.9023          
##                                           
##        'Positive' Class : 0               
## 

We can observe an increase in our accuracy thank to this pruning !
Let us build a second tree and compare its accuracy score.


Tree 2 : CART tree

Let us use CART algorithm with rpart package. The CART splits variable for classification by minimizing a homogeneity measure (Gini index or entrepoy). Here, we will use Gini index and will start with a complexity parameter (cp) to 0.


tree2.train = rpart(Class~., data=trainingset,cp=0)
plot(tree2.train)
text(tree2.train,pretty=0)

This first graph is quite unreadable.
This can be explained by the fact that rpart integrates all the variables in its model, unlike the previous tree.

tree2.train$cptable
##              CP nsplit  rel error    xerror        xstd
## 1  7.507314e-01      0 1.00000000 1.0000000 0.012780313
## 2  4.008192e-02      1 0.24926858 0.2492686 0.008055952
## 3  1.053248e-02      2 0.20918666 0.2091867 0.007452945
## 4  4.681100e-03      5 0.17758923 0.1854886 0.007058456
## 5  3.218256e-03      8 0.16325336 0.1764190 0.006898731
## 6  2.633119e-03     10 0.15681685 0.1632534 0.006657220
## 7  2.340550e-03     16 0.13633704 0.1524283 0.006449273
## 8  2.047981e-03     17 0.13399649 0.1427736 0.006255939
## 9  1.755413e-03     18 0.13194851 0.1418958 0.006237970
## 10 1.170275e-03     19 0.13019310 0.1375073 0.006147096
## 11 8.777063e-04     22 0.12668227 0.1354593 0.006104085
## 12 6.582797e-04     24 0.12492686 0.1325336 0.006041955
## 13 5.851375e-04     29 0.12053833 0.1296080 0.005978994
## 14 4.388531e-04     45 0.10883558 0.1272674 0.005928009
## 15 3.900917e-04     57 0.10181393 0.1237566 0.005850472
## 16 3.510825e-04     63 0.09947338 0.1240492 0.005856983
## 17 2.925688e-04     68 0.09771796 0.1225863 0.005824337
## 18 1.755413e-04     72 0.09654769 0.1234640 0.005843952
## 19 9.752292e-05     77 0.09566998 0.1228789 0.005830884
## 20 5.851375e-05     83 0.09508484 0.1237566 0.005850472
## 21 0.000000e+00     88 0.09479228 0.1260971 0.005902307

Here we can see complexity parameter which goes down with the number of splits n. We will look relative error : first value here is 1. Every next value is compared after this relative error.
Xerror : cross validation error of multiple train test.
xstd : standard variation.

Our task here is to pick the lowest cp with the lowest relative error.

plotcp(tree2.train)

A good choice of cp for pruning is often the leftmost value for which the mean lies below the horizontal line.

cp_tree2.train = tree2.train$cptable[which(tree2.train$cptable[,"xerror"]==min(tree2.train$cptable[,"xerror"])),"CP"]
cp_tree2.train
## [1] 0.0002925688
tree2=rpart(Class~., data=trainingset)
tree2_optimal = prune(tree2, cp=cp_tree2.train)

tree2.predict<-predict(tree2_optimal, newdata=testset[,-31], type="class")

c2<-confusionMatrix(factor(tree2.predict),factor(testset$Class))
c2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1246   68
##          1  234 1768
##                                           
##                Accuracy : 0.9089          
##                  95% CI : (0.8986, 0.9185)
##     No Information Rate : 0.5537          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8137          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8419          
##             Specificity : 0.9630          
##          Pos Pred Value : 0.9482          
##          Neg Pred Value : 0.8831          
##              Prevalence : 0.4463          
##          Detection Rate : 0.3758          
##    Detection Prevalence : 0.3963          
##       Balanced Accuracy : 0.9024          
##                                           
##        'Positive' Class : 0               
## 

From our second matrix confusion, we can observe a very slight improvement in our accuracy score.


Other Tree models

Tree 3: Random Forest

Random Forest reduces the variance of forecasts in a decision tree alone, thus improving performance. It does this by combining n decision trees in a bagging approach. We don’t prune the tree.

Each tree in the random forest is trained on a random subset of data. The predictions are then averaged.

tree3.train = randomForest(Class~., data=trainingset, ntree=1000, do.trace=T)
## ntree      OOB      1      2
##     1:   7.15%  7.61%  6.78%
##     2:   6.63%  6.89%  6.42%
##     3:   6.50%  6.63%  6.40%
##     4:   5.80%  6.01%  5.62%
##     5:   5.94%  6.06%  5.85%
##     6:   5.46%  5.73%  5.24%
##     7:   5.40%  5.96%  4.96%
##     8:   5.27%  5.85%  4.81%
##     9:   4.98%  5.55%  4.53%
##    10:   4.87%  5.49%  4.38%
##    11:   4.44%  5.15%  3.88%
##    12:   4.28%  4.82%  3.86%
##    13:   4.43%  4.89%  4.06%
##    14:   4.26%  4.80%  3.83%
##    15:   4.14%  4.80%  3.62%
##    16:   3.96%  4.62%  3.43%
##    17:   4.06%  4.77%  3.50%
##    18:   3.88%  4.51%  3.38%
##    19:   3.85%  4.48%  3.36%
##    20:   3.74%  4.48%  3.15%
##    21:   3.73%  4.53%  3.10%
##    22:   3.66%  4.56%  2.94%
##    23:   3.75%  4.65%  3.03%
##    24:   3.68%  4.51%  3.03%
##    25:   3.63%  4.51%  2.94%
##    26:   3.63%  4.68%  2.80%
##    27:   3.63%  4.62%  2.85%
##    28:   3.66%  4.59%  2.92%
##    29:   3.66%  4.71%  2.82%
##    30:   3.68%  4.77%  2.82%
##    31:   3.62%  4.59%  2.85%
##    32:   3.55%  4.68%  2.66%
##    33:   3.54%  4.56%  2.73%
##    34:   3.49%  4.62%  2.59%
##    35:   3.53%  4.74%  2.57%
##    36:   3.50%  4.65%  2.59%
##    37:   3.58%  4.83%  2.59%
##    38:   3.53%  4.80%  2.52%
##    39:   3.42%  4.53%  2.55%
##    40:   3.45%  4.59%  2.55%
##    41:   3.54%  4.68%  2.64%
##    42:   3.44%  4.51%  2.59%
##    43:   3.55%  4.68%  2.66%
##    44:   3.57%  4.65%  2.71%
##    45:   3.52%  4.68%  2.59%
##    46:   3.55%  4.59%  2.73%
##    47:   3.50%  4.51%  2.71%
##    48:   3.49%  4.59%  2.62%
##    49:   3.58%  4.65%  2.73%
##    50:   3.50%  4.53%  2.69%
##    51:   3.50%  4.51%  2.71%
##    52:   3.54%  4.65%  2.66%
##    53:   3.57%  4.68%  2.69%
##    54:   3.61%  4.71%  2.73%
##    55:   3.58%  4.68%  2.71%
##    56:   3.57%  4.74%  2.64%
##    57:   3.57%  4.71%  2.66%
##    58:   3.59%  4.74%  2.69%
##    59:   3.58%  4.68%  2.71%
##    60:   3.53%  4.65%  2.64%
##    61:   3.57%  4.74%  2.64%
##    62:   3.54%  4.74%  2.59%
##    63:   3.50%  4.62%  2.62%
##    64:   3.45%  4.56%  2.57%
##    65:   3.48%  4.65%  2.55%
##    66:   3.49%  4.62%  2.59%
##    67:   3.49%  4.62%  2.59%
##    68:   3.53%  4.56%  2.71%
##    69:   3.49%  4.59%  2.62%
##    70:   3.49%  4.65%  2.57%
##    71:   3.46%  4.53%  2.62%
##    72:   3.48%  4.62%  2.57%
##    73:   3.49%  4.62%  2.59%
##    74:   3.50%  4.62%  2.62%
##    75:   3.54%  4.62%  2.69%
##    76:   3.52%  4.62%  2.64%
##    77:   3.49%  4.56%  2.64%
##    78:   3.52%  4.62%  2.64%
##    79:   3.55%  4.65%  2.69%
##    80:   3.54%  4.62%  2.69%
##    81:   3.46%  4.53%  2.62%
##    82:   3.53%  4.59%  2.69%
##    83:   3.50%  4.48%  2.73%
##    84:   3.49%  4.56%  2.64%
##    85:   3.55%  4.62%  2.71%
##    86:   3.54%  4.53%  2.75%
##    87:   3.48%  4.48%  2.69%
##    88:   3.48%  4.36%  2.78%
##    89:   3.48%  4.45%  2.71%
##    90:   3.46%  4.45%  2.69%
##    91:   3.50%  4.51%  2.71%
##    92:   3.49%  4.53%  2.66%
##    93:   3.53%  4.59%  2.69%
##    94:   3.54%  4.59%  2.71%
##    95:   3.53%  4.56%  2.71%
##    96:   3.50%  4.53%  2.69%
##    97:   3.50%  4.59%  2.64%
##    98:   3.52%  4.53%  2.71%
##    99:   3.54%  4.56%  2.73%
##   100:   3.55%  4.62%  2.71%
##   101:   3.48%  4.53%  2.64%
##   102:   3.50%  4.48%  2.73%
##   103:   3.53%  4.53%  2.73%
##   104:   3.54%  4.53%  2.75%
##   105:   3.55%  4.53%  2.78%
##   106:   3.53%  4.51%  2.75%
##   107:   3.50%  4.45%  2.75%
##   108:   3.53%  4.56%  2.71%
##   109:   3.52%  4.56%  2.69%
##   110:   3.57%  4.65%  2.71%
##   111:   3.54%  4.56%  2.73%
##   112:   3.54%  4.59%  2.71%
##   113:   3.52%  4.59%  2.66%
##   114:   3.52%  4.62%  2.64%
##   115:   3.52%  4.56%  2.69%
##   116:   3.53%  4.65%  2.64%
##   117:   3.53%  4.62%  2.66%
##   118:   3.57%  4.65%  2.71%
##   119:   3.54%  4.62%  2.69%
##   120:   3.54%  4.62%  2.69%
##   121:   3.52%  4.59%  2.66%
##   122:   3.49%  4.59%  2.62%
##   123:   3.54%  4.62%  2.69%
##   124:   3.52%  4.56%  2.69%
##   125:   3.53%  4.62%  2.66%
##   126:   3.52%  4.53%  2.71%
##   127:   3.49%  4.53%  2.66%
##   128:   3.46%  4.51%  2.64%
##   129:   3.49%  4.56%  2.64%
##   130:   3.52%  4.59%  2.66%
##   131:   3.53%  4.62%  2.66%
##   132:   3.50%  4.59%  2.64%
##   133:   3.46%  4.56%  2.59%
##   134:   3.49%  4.62%  2.59%
##   135:   3.52%  4.59%  2.66%
##   136:   3.52%  4.62%  2.64%
##   137:   3.50%  4.62%  2.62%
##   138:   3.50%  4.68%  2.57%
##   139:   3.48%  4.59%  2.59%
##   140:   3.44%  4.56%  2.55%
##   141:   3.44%  4.53%  2.57%
##   142:   3.44%  4.53%  2.57%
##   143:   3.44%  4.51%  2.59%
##   144:   3.46%  4.51%  2.64%
##   145:   3.48%  4.51%  2.66%
##   146:   3.48%  4.53%  2.64%
##   147:   3.48%  4.51%  2.66%
##   148:   3.49%  4.51%  2.69%
##   149:   3.48%  4.51%  2.66%
##   150:   3.48%  4.56%  2.62%
##   151:   3.46%  4.53%  2.62%
##   152:   3.48%  4.56%  2.62%
##   153:   3.50%  4.59%  2.64%
##   154:   3.49%  4.59%  2.62%
##   155:   3.48%  4.59%  2.59%
##   156:   3.46%  4.56%  2.59%
##   157:   3.45%  4.53%  2.59%
##   158:   3.46%  4.53%  2.62%
##   159:   3.49%  4.56%  2.64%
##   160:   3.50%  4.59%  2.64%
##   161:   3.50%  4.62%  2.62%
##   162:   3.48%  4.59%  2.59%
##   163:   3.46%  4.62%  2.55%
##   164:   3.44%  4.59%  2.52%
##   165:   3.44%  4.59%  2.52%
##   166:   3.46%  4.59%  2.57%
##   167:   3.45%  4.59%  2.55%
##   168:   3.48%  4.62%  2.57%
##   169:   3.48%  4.65%  2.55%
##   170:   3.46%  4.62%  2.55%
##   171:   3.48%  4.62%  2.57%
##   172:   3.52%  4.65%  2.62%
##   173:   3.50%  4.65%  2.59%
##   174:   3.50%  4.65%  2.59%
##   175:   3.49%  4.65%  2.57%
##   176:   3.50%  4.62%  2.62%
##   177:   3.52%  4.68%  2.59%
##   178:   3.48%  4.62%  2.57%
##   179:   3.50%  4.62%  2.62%
##   180:   3.50%  4.59%  2.64%
##   181:   3.53%  4.65%  2.64%
##   182:   3.50%  4.68%  2.57%
##   183:   3.53%  4.65%  2.64%
##   184:   3.52%  4.65%  2.62%
##   185:   3.49%  4.62%  2.59%
##   186:   3.52%  4.59%  2.66%
##   187:   3.53%  4.65%  2.64%
##   188:   3.52%  4.62%  2.64%
##   189:   3.53%  4.71%  2.59%
##   190:   3.49%  4.62%  2.59%
##   191:   3.45%  4.59%  2.55%
##   192:   3.50%  4.62%  2.62%
##   193:   3.49%  4.59%  2.62%
##   194:   3.44%  4.51%  2.59%
##   195:   3.49%  4.56%  2.64%
##   196:   3.53%  4.68%  2.62%
##   197:   3.54%  4.74%  2.59%
##   198:   3.50%  4.62%  2.62%
##   199:   3.50%  4.59%  2.64%
##   200:   3.49%  4.62%  2.59%
##   201:   3.50%  4.65%  2.59%
##   202:   3.49%  4.62%  2.59%
##   203:   3.45%  4.59%  2.55%
##   204:   3.49%  4.62%  2.59%
##   205:   3.46%  4.59%  2.57%
##   206:   3.45%  4.53%  2.59%
##   207:   3.41%  4.51%  2.55%
##   208:   3.44%  4.53%  2.57%
##   209:   3.46%  4.53%  2.62%
##   210:   3.41%  4.51%  2.55%
##   211:   3.46%  4.62%  2.55%
##   212:   3.39%  4.48%  2.52%
##   213:   3.42%  4.56%  2.52%
##   214:   3.42%  4.56%  2.52%
##   215:   3.44%  4.59%  2.52%
##   216:   3.41%  4.56%  2.50%
##   217:   3.42%  4.53%  2.55%
##   218:   3.45%  4.59%  2.55%
##   219:   3.41%  4.51%  2.55%
##   220:   3.46%  4.56%  2.59%
##   221:   3.45%  4.53%  2.59%
##   222:   3.45%  4.59%  2.55%
##   223:   3.48%  4.62%  2.57%
##   224:   3.46%  4.59%  2.57%
##   225:   3.45%  4.56%  2.57%
##   226:   3.44%  4.56%  2.55%
##   227:   3.46%  4.65%  2.52%
##   228:   3.49%  4.65%  2.57%
##   229:   3.45%  4.62%  2.52%
##   230:   3.48%  4.62%  2.57%
##   231:   3.45%  4.59%  2.55%
##   232:   3.45%  4.56%  2.57%
##   233:   3.42%  4.56%  2.52%
##   234:   3.41%  4.53%  2.52%
##   235:   3.42%  4.56%  2.52%
##   236:   3.41%  4.56%  2.50%
##   237:   3.42%  4.56%  2.52%
##   238:   3.41%  4.56%  2.50%
##   239:   3.41%  4.59%  2.48%
##   240:   3.42%  4.59%  2.50%
##   241:   3.42%  4.59%  2.50%
##   242:   3.40%  4.56%  2.48%
##   243:   3.42%  4.56%  2.52%
##   244:   3.44%  4.59%  2.52%
##   245:   3.42%  4.53%  2.55%
##   246:   3.42%  4.56%  2.52%
##   247:   3.39%  4.51%  2.50%
##   248:   3.40%  4.53%  2.50%
##   249:   3.41%  4.56%  2.50%
##   250:   3.42%  4.56%  2.52%
##   251:   3.44%  4.56%  2.55%
##   252:   3.41%  4.56%  2.50%
##   253:   3.42%  4.56%  2.52%
##   254:   3.44%  4.59%  2.52%
##   255:   3.42%  4.59%  2.50%
##   256:   3.41%  4.56%  2.50%
##   257:   3.41%  4.59%  2.48%
##   258:   3.41%  4.56%  2.50%
##   259:   3.42%  4.56%  2.52%
##   260:   3.40%  4.53%  2.50%
##   261:   3.41%  4.53%  2.52%
##   262:   3.39%  4.53%  2.48%
##   263:   3.41%  4.59%  2.48%
##   264:   3.41%  4.59%  2.48%
##   265:   3.44%  4.62%  2.50%
##   266:   3.42%  4.59%  2.50%
##   267:   3.39%  4.53%  2.48%
##   268:   3.41%  4.56%  2.50%
##   269:   3.39%  4.59%  2.43%
##   270:   3.41%  4.62%  2.45%
##   271:   3.41%  4.65%  2.43%
##   272:   3.48%  4.68%  2.52%
##   273:   3.41%  4.59%  2.48%
##   274:   3.42%  4.59%  2.50%
##   275:   3.42%  4.59%  2.50%
##   276:   3.41%  4.56%  2.50%
##   277:   3.40%  4.53%  2.50%
##   278:   3.39%  4.51%  2.50%
##   279:   3.41%  4.56%  2.50%
##   280:   3.40%  4.53%  2.50%
##   281:   3.41%  4.59%  2.48%
##   282:   3.40%  4.56%  2.48%
##   283:   3.36%  4.48%  2.48%
##   284:   3.39%  4.53%  2.48%
##   285:   3.36%  4.48%  2.48%
##   286:   3.37%  4.51%  2.48%
##   287:   3.37%  4.51%  2.48%
##   288:   3.36%  4.48%  2.48%
##   289:   3.37%  4.51%  2.48%
##   290:   3.37%  4.51%  2.48%
##   291:   3.36%  4.48%  2.48%
##   292:   3.39%  4.51%  2.50%
##   293:   3.36%  4.48%  2.48%
##   294:   3.36%  4.48%  2.48%
##   295:   3.36%  4.48%  2.48%
##   296:   3.35%  4.48%  2.45%
##   297:   3.35%  4.48%  2.45%
##   298:   3.35%  4.45%  2.48%
##   299:   3.35%  4.48%  2.45%
##   300:   3.33%  4.45%  2.45%
##   301:   3.35%  4.48%  2.45%
##   302:   3.39%  4.53%  2.48%
##   303:   3.35%  4.48%  2.45%
##   304:   3.33%  4.48%  2.43%
##   305:   3.32%  4.48%  2.41%
##   306:   3.33%  4.48%  2.43%
##   307:   3.32%  4.45%  2.43%
##   308:   3.35%  4.45%  2.48%
##   309:   3.33%  4.42%  2.48%
##   310:   3.33%  4.42%  2.48%
##   311:   3.32%  4.45%  2.43%
##   312:   3.32%  4.42%  2.45%
##   313:   3.33%  4.45%  2.45%
##   314:   3.35%  4.45%  2.48%
##   315:   3.33%  4.45%  2.45%
##   316:   3.33%  4.45%  2.45%
##   317:   3.36%  4.48%  2.48%
##   318:   3.33%  4.45%  2.45%
##   319:   3.35%  4.48%  2.45%
##   320:   3.35%  4.48%  2.45%
##   321:   3.35%  4.45%  2.48%
##   322:   3.36%  4.48%  2.48%
##   323:   3.36%  4.48%  2.48%
##   324:   3.35%  4.48%  2.45%
##   325:   3.35%  4.48%  2.45%
##   326:   3.35%  4.48%  2.45%
##   327:   3.33%  4.45%  2.45%
##   328:   3.33%  4.48%  2.43%
##   329:   3.32%  4.42%  2.45%
##   330:   3.33%  4.45%  2.45%
##   331:   3.33%  4.42%  2.48%
##   332:   3.35%  4.45%  2.48%
##   333:   3.39%  4.53%  2.48%
##   334:   3.36%  4.51%  2.45%
##   335:   3.37%  4.53%  2.45%
##   336:   3.37%  4.53%  2.45%
##   337:   3.36%  4.53%  2.43%
##   338:   3.37%  4.53%  2.45%
##   339:   3.36%  4.51%  2.45%
##   340:   3.36%  4.51%  2.45%
##   341:   3.33%  4.48%  2.43%
##   342:   3.33%  4.48%  2.43%
##   343:   3.35%  4.48%  2.45%
##   344:   3.36%  4.51%  2.45%
##   345:   3.36%  4.48%  2.48%
##   346:   3.33%  4.48%  2.43%
##   347:   3.35%  4.51%  2.43%
##   348:   3.36%  4.48%  2.48%
##   349:   3.37%  4.51%  2.48%
##   350:   3.35%  4.48%  2.45%
##   351:   3.33%  4.45%  2.45%
##   352:   3.32%  4.42%  2.45%
##   353:   3.32%  4.42%  2.45%
##   354:   3.33%  4.45%  2.45%
##   355:   3.31%  4.42%  2.43%
##   356:   3.33%  4.48%  2.43%
##   357:   3.33%  4.48%  2.43%
##   358:   3.36%  4.53%  2.43%
##   359:   3.33%  4.48%  2.43%
##   360:   3.35%  4.51%  2.43%
##   361:   3.35%  4.51%  2.43%
##   362:   3.35%  4.51%  2.43%
##   363:   3.36%  4.51%  2.45%
##   364:   3.35%  4.51%  2.43%
##   365:   3.35%  4.53%  2.41%
##   366:   3.36%  4.51%  2.45%
##   367:   3.37%  4.53%  2.45%
##   368:   3.35%  4.51%  2.43%
##   369:   3.37%  4.53%  2.45%
##   370:   3.36%  4.51%  2.45%
##   371:   3.36%  4.51%  2.45%
##   372:   3.36%  4.51%  2.45%
##   373:   3.37%  4.51%  2.48%
##   374:   3.36%  4.51%  2.45%
##   375:   3.36%  4.51%  2.45%
##   376:   3.33%  4.45%  2.45%
##   377:   3.32%  4.42%  2.45%
##   378:   3.33%  4.45%  2.45%
##   379:   3.36%  4.51%  2.45%
##   380:   3.35%  4.48%  2.45%
##   381:   3.33%  4.45%  2.45%
##   382:   3.32%  4.42%  2.45%
##   383:   3.32%  4.42%  2.45%
##   384:   3.32%  4.42%  2.45%
##   385:   3.32%  4.42%  2.45%
##   386:   3.31%  4.42%  2.43%
##   387:   3.30%  4.39%  2.43%
##   388:   3.31%  4.45%  2.41%
##   389:   3.32%  4.45%  2.43%
##   390:   3.32%  4.42%  2.45%
##   391:   3.33%  4.48%  2.43%
##   392:   3.32%  4.45%  2.43%
##   393:   3.35%  4.48%  2.45%
##   394:   3.33%  4.48%  2.43%
##   395:   3.36%  4.51%  2.45%
##   396:   3.33%  4.48%  2.43%
##   397:   3.33%  4.45%  2.45%
##   398:   3.35%  4.48%  2.45%
##   399:   3.33%  4.45%  2.45%
##   400:   3.32%  4.45%  2.43%
##   401:   3.33%  4.48%  2.43%
##   402:   3.35%  4.51%  2.43%
##   403:   3.35%  4.51%  2.43%
##   404:   3.35%  4.51%  2.43%
##   405:   3.33%  4.48%  2.43%
##   406:   3.36%  4.53%  2.43%
##   407:   3.36%  4.51%  2.45%
##   408:   3.37%  4.53%  2.45%
##   409:   3.36%  4.53%  2.43%
##   410:   3.37%  4.53%  2.45%
##   411:   3.35%  4.51%  2.43%
##   412:   3.36%  4.51%  2.45%
##   413:   3.35%  4.48%  2.45%
##   414:   3.36%  4.51%  2.45%
##   415:   3.35%  4.48%  2.45%
##   416:   3.37%  4.53%  2.45%
##   417:   3.36%  4.53%  2.43%
##   418:   3.36%  4.53%  2.43%
##   419:   3.35%  4.51%  2.43%
##   420:   3.35%  4.51%  2.43%
##   421:   3.36%  4.51%  2.45%
##   422:   3.33%  4.45%  2.45%
##   423:   3.37%  4.53%  2.45%
##   424:   3.37%  4.51%  2.48%
##   425:   3.37%  4.51%  2.48%
##   426:   3.37%  4.53%  2.45%
##   427:   3.37%  4.51%  2.48%
##   428:   3.35%  4.51%  2.43%
##   429:   3.40%  4.53%  2.50%
##   430:   3.37%  4.51%  2.48%
##   431:   3.36%  4.51%  2.45%
##   432:   3.39%  4.53%  2.48%
##   433:   3.37%  4.53%  2.45%
##   434:   3.39%  4.53%  2.48%
##   435:   3.39%  4.53%  2.48%
##   436:   3.37%  4.51%  2.48%
##   437:   3.37%  4.53%  2.45%
##   438:   3.42%  4.56%  2.52%
##   439:   3.41%  4.53%  2.52%
##   440:   3.39%  4.53%  2.48%
##   441:   3.40%  4.53%  2.50%
##   442:   3.41%  4.53%  2.52%
##   443:   3.40%  4.56%  2.48%
##   444:   3.40%  4.56%  2.48%
##   445:   3.40%  4.56%  2.48%
##   446:   3.40%  4.56%  2.48%
##   447:   3.41%  4.56%  2.50%
##   448:   3.40%  4.56%  2.48%
##   449:   3.41%  4.56%  2.50%
##   450:   3.41%  4.56%  2.50%
##   451:   3.39%  4.53%  2.48%
##   452:   3.40%  4.56%  2.48%
##   453:   3.41%  4.56%  2.50%
##   454:   3.39%  4.53%  2.48%
##   455:   3.40%  4.53%  2.50%
##   456:   3.41%  4.53%  2.52%
##   457:   3.40%  4.51%  2.52%
##   458:   3.39%  4.51%  2.50%
##   459:   3.39%  4.51%  2.50%
##   460:   3.41%  4.53%  2.52%
##   461:   3.41%  4.53%  2.52%
##   462:   3.40%  4.53%  2.50%
##   463:   3.41%  4.53%  2.52%
##   464:   3.44%  4.59%  2.52%
##   465:   3.40%  4.53%  2.50%
##   466:   3.37%  4.48%  2.50%
##   467:   3.37%  4.48%  2.50%
##   468:   3.37%  4.48%  2.50%
##   469:   3.37%  4.48%  2.50%
##   470:   3.39%  4.51%  2.50%
##   471:   3.39%  4.48%  2.52%
##   472:   3.39%  4.48%  2.52%
##   473:   3.39%  4.48%  2.52%
##   474:   3.40%  4.51%  2.52%
##   475:   3.37%  4.45%  2.52%
##   476:   3.42%  4.56%  2.52%
##   477:   3.40%  4.51%  2.52%
##   478:   3.40%  4.51%  2.52%
##   479:   3.37%  4.48%  2.50%
##   480:   3.37%  4.48%  2.50%
##   481:   3.39%  4.51%  2.50%
##   482:   3.39%  4.51%  2.50%
##   483:   3.39%  4.51%  2.50%
##   484:   3.39%  4.51%  2.50%
##   485:   3.39%  4.51%  2.50%
##   486:   3.39%  4.51%  2.50%
##   487:   3.40%  4.51%  2.52%
##   488:   3.39%  4.53%  2.48%
##   489:   3.39%  4.53%  2.48%
##   490:   3.41%  4.53%  2.52%
##   491:   3.40%  4.53%  2.50%
##   492:   3.42%  4.56%  2.52%
##   493:   3.39%  4.51%  2.50%
##   494:   3.41%  4.51%  2.55%
##   495:   3.40%  4.51%  2.52%
##   496:   3.40%  4.53%  2.50%
##   497:   3.44%  4.59%  2.52%
##   498:   3.41%  4.53%  2.52%
##   499:   3.42%  4.56%  2.52%
##   500:   3.42%  4.56%  2.52%
##   501:   3.41%  4.56%  2.50%
##   502:   3.42%  4.56%  2.52%
##   503:   3.44%  4.59%  2.52%
##   504:   3.42%  4.56%  2.52%
##   505:   3.42%  4.56%  2.52%
##   506:   3.45%  4.59%  2.55%
##   507:   3.44%  4.56%  2.55%
##   508:   3.44%  4.59%  2.52%
##   509:   3.44%  4.56%  2.55%
##   510:   3.45%  4.59%  2.55%
##   511:   3.44%  4.56%  2.55%
##   512:   3.44%  4.56%  2.55%
##   513:   3.44%  4.59%  2.52%
##   514:   3.44%  4.62%  2.50%
##   515:   3.48%  4.65%  2.55%
##   516:   3.44%  4.59%  2.52%
##   517:   3.45%  4.59%  2.55%
##   518:   3.46%  4.62%  2.55%
##   519:   3.46%  4.62%  2.55%
##   520:   3.44%  4.59%  2.52%
##   521:   3.42%  4.56%  2.52%
##   522:   3.41%  4.56%  2.50%
##   523:   3.44%  4.59%  2.52%
##   524:   3.44%  4.59%  2.52%
##   525:   3.44%  4.59%  2.52%
##   526:   3.44%  4.59%  2.52%
##   527:   3.42%  4.59%  2.50%
##   528:   3.42%  4.59%  2.50%
##   529:   3.42%  4.59%  2.50%
##   530:   3.41%  4.56%  2.50%
##   531:   3.41%  4.56%  2.50%
##   532:   3.41%  4.56%  2.50%
##   533:   3.41%  4.56%  2.50%
##   534:   3.41%  4.56%  2.50%
##   535:   3.41%  4.56%  2.50%
##   536:   3.42%  4.59%  2.50%
##   537:   3.41%  4.56%  2.50%
##   538:   3.40%  4.53%  2.50%
##   539:   3.39%  4.51%  2.50%
##   540:   3.39%  4.51%  2.50%
##   541:   3.39%  4.51%  2.50%
##   542:   3.40%  4.53%  2.50%
##   543:   3.40%  4.53%  2.50%
##   544:   3.39%  4.51%  2.50%
##   545:   3.39%  4.51%  2.50%
##   546:   3.40%  4.53%  2.50%
##   547:   3.39%  4.51%  2.50%
##   548:   3.39%  4.51%  2.50%
##   549:   3.39%  4.51%  2.50%
##   550:   3.39%  4.51%  2.50%
##   551:   3.39%  4.51%  2.50%
##   552:   3.40%  4.53%  2.50%
##   553:   3.40%  4.53%  2.50%
##   554:   3.40%  4.53%  2.50%
##   555:   3.39%  4.51%  2.50%
##   556:   3.39%  4.51%  2.50%
##   557:   3.39%  4.53%  2.48%
##   558:   3.39%  4.53%  2.48%
##   559:   3.39%  4.51%  2.50%
##   560:   3.37%  4.51%  2.48%
##   561:   3.39%  4.53%  2.48%
##   562:   3.39%  4.53%  2.48%
##   563:   3.39%  4.51%  2.50%
##   564:   3.39%  4.51%  2.50%
##   565:   3.40%  4.53%  2.50%
##   566:   3.39%  4.51%  2.50%
##   567:   3.37%  4.51%  2.48%
##   568:   3.39%  4.51%  2.50%
##   569:   3.39%  4.51%  2.50%
##   570:   3.36%  4.51%  2.45%
##   571:   3.37%  4.51%  2.48%
##   572:   3.39%  4.51%  2.50%
##   573:   3.39%  4.51%  2.50%
##   574:   3.39%  4.51%  2.50%
##   575:   3.39%  4.51%  2.50%
##   576:   3.39%  4.51%  2.50%
##   577:   3.39%  4.51%  2.50%
##   578:   3.36%  4.51%  2.45%
##   579:   3.36%  4.51%  2.45%
##   580:   3.39%  4.51%  2.50%
##   581:   3.36%  4.51%  2.45%
##   582:   3.37%  4.51%  2.48%
##   583:   3.37%  4.51%  2.48%
##   584:   3.37%  4.51%  2.48%
##   585:   3.39%  4.53%  2.48%
##   586:   3.40%  4.56%  2.48%
##   587:   3.41%  4.56%  2.50%
##   588:   3.41%  4.59%  2.48%
##   589:   3.41%  4.56%  2.50%
##   590:   3.41%  4.59%  2.48%
##   591:   3.41%  4.56%  2.50%
##   592:   3.41%  4.56%  2.50%
##   593:   3.41%  4.56%  2.50%
##   594:   3.40%  4.56%  2.48%
##   595:   3.40%  4.53%  2.50%
##   596:   3.42%  4.56%  2.52%
##   597:   3.42%  4.56%  2.52%
##   598:   3.40%  4.53%  2.50%
##   599:   3.42%  4.56%  2.52%
##   600:   3.41%  4.56%  2.50%
##   601:   3.41%  4.56%  2.50%
##   602:   3.41%  4.56%  2.50%
##   603:   3.41%  4.56%  2.50%
##   604:   3.41%  4.56%  2.50%
##   605:   3.42%  4.59%  2.50%
##   606:   3.41%  4.56%  2.50%
##   607:   3.42%  4.59%  2.50%
##   608:   3.42%  4.59%  2.50%
##   609:   3.42%  4.59%  2.50%
##   610:   3.41%  4.56%  2.50%
##   611:   3.42%  4.59%  2.50%
##   612:   3.44%  4.59%  2.52%
##   613:   3.41%  4.56%  2.50%
##   614:   3.42%  4.56%  2.52%
##   615:   3.42%  4.56%  2.52%
##   616:   3.44%  4.56%  2.55%
##   617:   3.41%  4.56%  2.50%
##   618:   3.41%  4.56%  2.50%
##   619:   3.44%  4.59%  2.52%
##   620:   3.42%  4.56%  2.52%
##   621:   3.42%  4.56%  2.52%
##   622:   3.42%  4.56%  2.52%
##   623:   3.44%  4.59%  2.52%
##   624:   3.42%  4.59%  2.50%
##   625:   3.44%  4.62%  2.50%
##   626:   3.45%  4.62%  2.52%
##   627:   3.44%  4.59%  2.52%
##   628:   3.44%  4.62%  2.50%
##   629:   3.46%  4.62%  2.55%
##   630:   3.44%  4.59%  2.52%
##   631:   3.45%  4.56%  2.57%
##   632:   3.44%  4.56%  2.55%
##   633:   3.44%  4.56%  2.55%
##   634:   3.44%  4.59%  2.52%
##   635:   3.46%  4.59%  2.57%
##   636:   3.44%  4.56%  2.55%
##   637:   3.42%  4.56%  2.52%
##   638:   3.42%  4.56%  2.52%
##   639:   3.44%  4.59%  2.52%
##   640:   3.46%  4.62%  2.55%
##   641:   3.45%  4.62%  2.52%
##   642:   3.44%  4.59%  2.52%
##   643:   3.44%  4.59%  2.52%
##   644:   3.45%  4.62%  2.52%
##   645:   3.46%  4.62%  2.55%
##   646:   3.48%  4.65%  2.55%
##   647:   3.46%  4.65%  2.52%
##   648:   3.45%  4.62%  2.52%
##   649:   3.45%  4.62%  2.52%
##   650:   3.46%  4.65%  2.52%
##   651:   3.46%  4.65%  2.52%
##   652:   3.46%  4.65%  2.52%
##   653:   3.46%  4.65%  2.52%
##   654:   3.45%  4.62%  2.52%
##   655:   3.46%  4.65%  2.52%
##   656:   3.45%  4.62%  2.52%
##   657:   3.46%  4.65%  2.52%
##   658:   3.44%  4.62%  2.50%
##   659:   3.44%  4.62%  2.50%
##   660:   3.44%  4.62%  2.50%
##   661:   3.44%  4.62%  2.50%
##   662:   3.44%  4.62%  2.50%
##   663:   3.44%  4.62%  2.50%
##   664:   3.44%  4.62%  2.50%
##   665:   3.42%  4.59%  2.50%
##   666:   3.44%  4.62%  2.50%
##   667:   3.44%  4.62%  2.50%
##   668:   3.44%  4.62%  2.50%
##   669:   3.42%  4.59%  2.50%
##   670:   3.44%  4.62%  2.50%
##   671:   3.45%  4.65%  2.50%
##   672:   3.45%  4.62%  2.52%
##   673:   3.45%  4.62%  2.52%
##   674:   3.44%  4.59%  2.52%
##   675:   3.45%  4.62%  2.52%
##   676:   3.44%  4.59%  2.52%
##   677:   3.45%  4.62%  2.52%
##   678:   3.45%  4.62%  2.52%
##   679:   3.44%  4.62%  2.50%
##   680:   3.44%  4.59%  2.52%
##   681:   3.42%  4.59%  2.50%
##   682:   3.42%  4.59%  2.50%
##   683:   3.44%  4.59%  2.52%
##   684:   3.45%  4.62%  2.52%
##   685:   3.45%  4.62%  2.52%
##   686:   3.45%  4.59%  2.55%
##   687:   3.45%  4.59%  2.55%
##   688:   3.44%  4.59%  2.52%
##   689:   3.41%  4.53%  2.52%
##   690:   3.42%  4.56%  2.52%
##   691:   3.42%  4.53%  2.55%
##   692:   3.41%  4.53%  2.52%
##   693:   3.41%  4.51%  2.55%
##   694:   3.44%  4.53%  2.57%
##   695:   3.42%  4.53%  2.55%
##   696:   3.40%  4.53%  2.50%
##   697:   3.41%  4.53%  2.52%
##   698:   3.44%  4.56%  2.55%
##   699:   3.40%  4.51%  2.52%
##   700:   3.40%  4.51%  2.52%
##   701:   3.39%  4.51%  2.50%
##   702:   3.39%  4.51%  2.50%
##   703:   3.40%  4.51%  2.52%
##   704:   3.41%  4.51%  2.55%
##   705:   3.40%  4.51%  2.52%
##   706:   3.41%  4.51%  2.55%
##   707:   3.41%  4.53%  2.52%
##   708:   3.41%  4.53%  2.52%
##   709:   3.41%  4.53%  2.52%
##   710:   3.41%  4.53%  2.52%
##   711:   3.41%  4.53%  2.52%
##   712:   3.41%  4.53%  2.52%
##   713:   3.41%  4.53%  2.52%
##   714:   3.41%  4.53%  2.52%
##   715:   3.42%  4.53%  2.55%
##   716:   3.40%  4.51%  2.52%
##   717:   3.40%  4.51%  2.52%
##   718:   3.39%  4.51%  2.50%
##   719:   3.39%  4.51%  2.50%
##   720:   3.40%  4.51%  2.52%
##   721:   3.40%  4.51%  2.52%
##   722:   3.40%  4.51%  2.52%
##   723:   3.40%  4.51%  2.52%
##   724:   3.40%  4.51%  2.52%
##   725:   3.40%  4.51%  2.52%
##   726:   3.40%  4.51%  2.52%
##   727:   3.40%  4.51%  2.52%
##   728:   3.41%  4.53%  2.52%
##   729:   3.41%  4.53%  2.52%
##   730:   3.41%  4.53%  2.52%
##   731:   3.41%  4.53%  2.52%
##   732:   3.42%  4.56%  2.52%
##   733:   3.42%  4.56%  2.52%
##   734:   3.41%  4.53%  2.52%
##   735:   3.42%  4.56%  2.52%
##   736:   3.42%  4.56%  2.52%
##   737:   3.42%  4.56%  2.52%
##   738:   3.44%  4.59%  2.52%
##   739:   3.42%  4.56%  2.52%
##   740:   3.44%  4.59%  2.52%
##   741:   3.44%  4.59%  2.52%
##   742:   3.41%  4.56%  2.50%
##   743:   3.42%  4.56%  2.52%
##   744:   3.41%  4.53%  2.52%
##   745:   3.41%  4.53%  2.52%
##   746:   3.41%  4.53%  2.52%
##   747:   3.41%  4.53%  2.52%
##   748:   3.41%  4.53%  2.52%
##   749:   3.41%  4.53%  2.52%
##   750:   3.41%  4.53%  2.52%
##   751:   3.41%  4.53%  2.52%
##   752:   3.41%  4.53%  2.52%
##   753:   3.41%  4.53%  2.52%
##   754:   3.41%  4.53%  2.52%
##   755:   3.39%  4.51%  2.50%
##   756:   3.39%  4.51%  2.50%
##   757:   3.42%  4.53%  2.55%
##   758:   3.44%  4.53%  2.57%
##   759:   3.40%  4.51%  2.52%
##   760:   3.41%  4.53%  2.52%
##   761:   3.41%  4.53%  2.52%
##   762:   3.42%  4.53%  2.55%
##   763:   3.41%  4.53%  2.52%
##   764:   3.41%  4.53%  2.52%
##   765:   3.41%  4.56%  2.50%
##   766:   3.41%  4.53%  2.52%
##   767:   3.41%  4.53%  2.52%
##   768:   3.41%  4.53%  2.52%
##   769:   3.41%  4.53%  2.52%
##   770:   3.41%  4.53%  2.52%
##   771:   3.42%  4.56%  2.52%
##   772:   3.42%  4.56%  2.52%
##   773:   3.42%  4.56%  2.52%
##   774:   3.42%  4.56%  2.52%
##   775:   3.41%  4.56%  2.50%
##   776:   3.41%  4.56%  2.50%
##   777:   3.41%  4.56%  2.50%
##   778:   3.41%  4.56%  2.50%
##   779:   3.40%  4.56%  2.48%
##   780:   3.40%  4.56%  2.48%
##   781:   3.40%  4.56%  2.48%
##   782:   3.40%  4.56%  2.48%
##   783:   3.39%  4.53%  2.48%
##   784:   3.40%  4.56%  2.48%
##   785:   3.40%  4.56%  2.48%
##   786:   3.40%  4.56%  2.48%
##   787:   3.40%  4.56%  2.48%
##   788:   3.41%  4.59%  2.48%
##   789:   3.40%  4.56%  2.48%
##   790:   3.42%  4.59%  2.50%
##   791:   3.42%  4.59%  2.50%
##   792:   3.44%  4.62%  2.50%
##   793:   3.44%  4.62%  2.50%
##   794:   3.42%  4.59%  2.50%
##   795:   3.42%  4.59%  2.50%
##   796:   3.44%  4.62%  2.50%
##   797:   3.42%  4.59%  2.50%
##   798:   3.44%  4.59%  2.52%
##   799:   3.42%  4.59%  2.50%
##   800:   3.44%  4.59%  2.52%
##   801:   3.44%  4.59%  2.52%
##   802:   3.44%  4.59%  2.52%
##   803:   3.45%  4.59%  2.55%
##   804:   3.45%  4.59%  2.55%
##   805:   3.44%  4.59%  2.52%
##   806:   3.44%  4.59%  2.52%
##   807:   3.44%  4.59%  2.52%
##   808:   3.44%  4.59%  2.52%
##   809:   3.44%  4.59%  2.52%
##   810:   3.44%  4.59%  2.52%
##   811:   3.42%  4.59%  2.50%
##   812:   3.44%  4.62%  2.50%
##   813:   3.44%  4.62%  2.50%
##   814:   3.45%  4.62%  2.52%
##   815:   3.45%  4.65%  2.50%
##   816:   3.46%  4.65%  2.52%
##   817:   3.45%  4.65%  2.50%
##   818:   3.45%  4.65%  2.50%
##   819:   3.44%  4.62%  2.50%
##   820:   3.45%  4.62%  2.52%
##   821:   3.46%  4.62%  2.55%
##   822:   3.46%  4.62%  2.55%
##   823:   3.46%  4.62%  2.55%
##   824:   3.45%  4.62%  2.52%
##   825:   3.48%  4.65%  2.55%
##   826:   3.48%  4.65%  2.55%
##   827:   3.48%  4.65%  2.55%
##   828:   3.48%  4.65%  2.55%
##   829:   3.48%  4.65%  2.55%
##   830:   3.46%  4.62%  2.55%
##   831:   3.46%  4.62%  2.55%
##   832:   3.46%  4.62%  2.55%
##   833:   3.45%  4.59%  2.55%
##   834:   3.45%  4.62%  2.52%
##   835:   3.46%  4.62%  2.55%
##   836:   3.46%  4.62%  2.55%
##   837:   3.48%  4.65%  2.55%
##   838:   3.48%  4.65%  2.55%
##   839:   3.48%  4.65%  2.55%
##   840:   3.48%  4.65%  2.55%
##   841:   3.46%  4.65%  2.52%
##   842:   3.48%  4.65%  2.55%
##   843:   3.46%  4.65%  2.52%
##   844:   3.46%  4.62%  2.55%
##   845:   3.46%  4.62%  2.55%
##   846:   3.46%  4.62%  2.55%
##   847:   3.48%  4.68%  2.52%
##   848:   3.48%  4.68%  2.52%
##   849:   3.42%  4.62%  2.48%
##   850:   3.44%  4.62%  2.50%
##   851:   3.45%  4.65%  2.50%
##   852:   3.46%  4.65%  2.52%
##   853:   3.45%  4.65%  2.50%
##   854:   3.46%  4.65%  2.52%
##   855:   3.48%  4.68%  2.52%
##   856:   3.46%  4.65%  2.52%
##   857:   3.45%  4.62%  2.52%
##   858:   3.45%  4.62%  2.52%
##   859:   3.48%  4.65%  2.55%
##   860:   3.46%  4.62%  2.55%
##   861:   3.46%  4.62%  2.55%
##   862:   3.48%  4.65%  2.55%
##   863:   3.48%  4.65%  2.55%
##   864:   3.48%  4.65%  2.55%
##   865:   3.48%  4.65%  2.55%
##   866:   3.48%  4.65%  2.55%
##   867:   3.48%  4.65%  2.55%
##   868:   3.48%  4.65%  2.55%
##   869:   3.46%  4.65%  2.52%
##   870:   3.46%  4.65%  2.52%
##   871:   3.46%  4.65%  2.52%
##   872:   3.48%  4.65%  2.55%
##   873:   3.48%  4.65%  2.55%
##   874:   3.48%  4.65%  2.55%
##   875:   3.46%  4.65%  2.52%
##   876:   3.46%  4.65%  2.52%
##   877:   3.45%  4.62%  2.52%
##   878:   3.45%  4.62%  2.52%
##   879:   3.46%  4.62%  2.55%
##   880:   3.48%  4.65%  2.55%
##   881:   3.46%  4.62%  2.55%
##   882:   3.45%  4.62%  2.52%
##   883:   3.46%  4.62%  2.55%
##   884:   3.45%  4.59%  2.55%
##   885:   3.45%  4.59%  2.55%
##   886:   3.42%  4.59%  2.50%
##   887:   3.44%  4.59%  2.52%
##   888:   3.46%  4.59%  2.57%
##   889:   3.44%  4.56%  2.55%
##   890:   3.44%  4.56%  2.55%
##   891:   3.46%  4.59%  2.57%
##   892:   3.44%  4.56%  2.55%
##   893:   3.44%  4.56%  2.55%
##   894:   3.45%  4.56%  2.57%
##   895:   3.44%  4.56%  2.55%
##   896:   3.42%  4.56%  2.52%
##   897:   3.42%  4.56%  2.52%
##   898:   3.42%  4.56%  2.52%
##   899:   3.41%  4.56%  2.50%
##   900:   3.44%  4.56%  2.55%
##   901:   3.44%  4.56%  2.55%
##   902:   3.45%  4.56%  2.57%
##   903:   3.42%  4.53%  2.55%
##   904:   3.45%  4.56%  2.57%
##   905:   3.45%  4.56%  2.57%
##   906:   3.44%  4.53%  2.57%
##   907:   3.42%  4.53%  2.55%
##   908:   3.45%  4.56%  2.57%
##   909:   3.45%  4.56%  2.57%
##   910:   3.45%  4.56%  2.57%
##   911:   3.44%  4.56%  2.55%
##   912:   3.46%  4.59%  2.57%
##   913:   3.45%  4.59%  2.55%
##   914:   3.44%  4.59%  2.52%
##   915:   3.45%  4.59%  2.55%
##   916:   3.45%  4.59%  2.55%
##   917:   3.44%  4.56%  2.55%
##   918:   3.44%  4.56%  2.55%
##   919:   3.45%  4.59%  2.55%
##   920:   3.44%  4.56%  2.55%
##   921:   3.44%  4.56%  2.55%
##   922:   3.44%  4.56%  2.55%
##   923:   3.42%  4.56%  2.52%
##   924:   3.44%  4.56%  2.55%
##   925:   3.44%  4.56%  2.55%
##   926:   3.44%  4.56%  2.55%
##   927:   3.44%  4.56%  2.55%
##   928:   3.46%  4.59%  2.57%
##   929:   3.45%  4.56%  2.57%
##   930:   3.46%  4.59%  2.57%
##   931:   3.46%  4.59%  2.57%
##   932:   3.45%  4.56%  2.57%
##   933:   3.46%  4.59%  2.57%
##   934:   3.46%  4.59%  2.57%
##   935:   3.45%  4.59%  2.55%
##   936:   3.45%  4.56%  2.57%
##   937:   3.45%  4.56%  2.57%
##   938:   3.46%  4.59%  2.57%
##   939:   3.44%  4.56%  2.55%
##   940:   3.45%  4.56%  2.57%
##   941:   3.44%  4.56%  2.55%
##   942:   3.44%  4.56%  2.55%
##   943:   3.44%  4.56%  2.55%
##   944:   3.42%  4.53%  2.55%
##   945:   3.44%  4.53%  2.57%
##   946:   3.44%  4.53%  2.57%
##   947:   3.45%  4.56%  2.57%
##   948:   3.44%  4.53%  2.57%
##   949:   3.42%  4.53%  2.55%
##   950:   3.45%  4.56%  2.57%
##   951:   3.44%  4.56%  2.55%
##   952:   3.46%  4.59%  2.57%
##   953:   3.48%  4.59%  2.59%
##   954:   3.48%  4.62%  2.57%
##   955:   3.46%  4.59%  2.57%
##   956:   3.42%  4.53%  2.55%
##   957:   3.44%  4.56%  2.55%
##   958:   3.44%  4.56%  2.55%
##   959:   3.44%  4.56%  2.55%
##   960:   3.45%  4.59%  2.55%
##   961:   3.44%  4.56%  2.55%
##   962:   3.42%  4.53%  2.55%
##   963:   3.42%  4.53%  2.55%
##   964:   3.44%  4.56%  2.55%
##   965:   3.42%  4.56%  2.52%
##   966:   3.44%  4.56%  2.55%
##   967:   3.44%  4.56%  2.55%
##   968:   3.44%  4.59%  2.52%
##   969:   3.42%  4.56%  2.52%
##   970:   3.42%  4.56%  2.52%
##   971:   3.42%  4.56%  2.52%
##   972:   3.42%  4.56%  2.52%
##   973:   3.41%  4.56%  2.50%
##   974:   3.42%  4.56%  2.52%
##   975:   3.42%  4.56%  2.52%
##   976:   3.41%  4.56%  2.50%
##   977:   3.42%  4.56%  2.52%
##   978:   3.41%  4.56%  2.50%
##   979:   3.41%  4.56%  2.50%
##   980:   3.42%  4.56%  2.52%
##   981:   3.42%  4.56%  2.52%
##   982:   3.44%  4.56%  2.55%
##   983:   3.44%  4.56%  2.55%
##   984:   3.46%  4.56%  2.59%
##   985:   3.46%  4.56%  2.59%
##   986:   3.44%  4.56%  2.55%
##   987:   3.46%  4.56%  2.59%
##   988:   3.45%  4.56%  2.57%
##   989:   3.44%  4.56%  2.55%
##   990:   3.44%  4.56%  2.55%
##   991:   3.45%  4.56%  2.57%
##   992:   3.44%  4.56%  2.55%
##   993:   3.45%  4.56%  2.57%
##   994:   3.46%  4.56%  2.59%
##   995:   3.44%  4.56%  2.55%
##   996:   3.46%  4.59%  2.57%
##   997:   3.44%  4.56%  2.55%
##   998:   3.44%  4.56%  2.55%
##   999:   3.46%  4.59%  2.57%
##  1000:   3.46%  4.59%  2.57%
varImpPlot(tree3.train) 

plot(tree3.train)

Let us apply our model which reduces variance of forecast with averaged predictions from generated subset trees on test dataset.

tree3.predict = predict(tree3.train,newdata=testset[,-31],type="class")
c3<-confusionMatrix(factor(tree3.predict),factor(testset$Class))
c3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1401   37
##          1   79 1799
##                                          
##                Accuracy : 0.965          
##                  95% CI : (0.9582, 0.971)
##     No Information Rate : 0.5537         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.929          
##                                          
##  Mcnemar's Test P-Value : 0.0001408      
##                                          
##             Sensitivity : 0.9466         
##             Specificity : 0.9798         
##          Pos Pred Value : 0.9743         
##          Neg Pred Value : 0.9579         
##              Prevalence : 0.4463         
##          Detection Rate : 0.4225         
##    Detection Prevalence : 0.4337         
##       Balanced Accuracy : 0.9632         
##                                          
##        'Positive' Class : 0              
## 

We have here an accuracy score of 0.96 which is way better than previous ! Let us compare it with one final tree model.


Tree 4 : Boosted Tree

Finally, last tree here is Boosted Tree aka XGBoost. XGBoost is a well-known and efficient open source implementation of the improved gradient tree algorithm.
Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target variable by combining estimates from a simpler and weaker set of models. GBoost reduces a regularized objective function (L1 and L2) that combines a convex loss function (based on the difference between predicted and target outputs) and a penalty condition for model complexity (in other words, regression tree functions).
Training continues iteratively, adding new trees that predict residuals or errors from previous trees that are then combined with the previous trees to make the final prediction.

In other words we are building a tree and looks which value is predicted poorly and assign to it higher weigh in our prediction.

Let us build and predict our model with 1000 maximum number of boosting iterations.


tree4.train = xgboost::xgboost(data=data.matrix(trainingset[,-31]),label=as.numeric(as.character(trainingset$Class)),nrounds=1000,params=list(booster="gbtree", eta=0.10, max_depth = 3, objective="binary:logistic",subsample = 0.50, colsample_bytree=0.50))
## [1]  train-error:0.136211 
## [2]  train-error:0.103257 
## [3]  train-error:0.085681 
## [4]  train-error:0.085035 
## [5]  train-error:0.100026 
## [6]  train-error:0.086327 
## [7]  train-error:0.088912 
## [8]  train-error:0.100414 
## [9]  train-error:0.101060 
## [10] train-error:0.100155 
## [11] train-error:0.086973 
## [12] train-error:0.087878 
## [13] train-error:0.089041 
## [14] train-error:0.090463 
## [15] train-error:0.086586 
## [16] train-error:0.084776 
## [17] train-error:0.082321 
## [18] train-error:0.081416 
## [19] train-error:0.076635 
## [20] train-error:0.073533 
## [21] train-error:0.072112 
## [22] train-error:0.071465 
## [23] train-error:0.070044 
## [24] train-error:0.070690 
## [25] train-error:0.070432 
## [26] train-error:0.070302 
## [27] train-error:0.070819 
## [28] train-error:0.070302 
## [29] train-error:0.070561 
## [30] train-error:0.070432 
## [31] train-error:0.070302 
## [32] train-error:0.070302 
## [33] train-error:0.070302 
## [34] train-error:0.069527 
## [35] train-error:0.070044 
## [36] train-error:0.068235 
## [37] train-error:0.068622 
## [38] train-error:0.068752 
## [39] train-error:0.067847 
## [40] train-error:0.067976 
## [41] train-error:0.067589 
## [42] train-error:0.066684 
## [43] train-error:0.067072 
## [44] train-error:0.066942 
## [45] train-error:0.066167 
## [46] train-error:0.065779 
## [47] train-error:0.065133 
## [48] train-error:0.065004 
## [49] train-error:0.064616 
## [50] train-error:0.064616 
## [51] train-error:0.063065 
## [52] train-error:0.063841 
## [53] train-error:0.063841 
## [54] train-error:0.063582 
## [55] train-error:0.063841 
## [56] train-error:0.062678 
## [57] train-error:0.062161 
## [58] train-error:0.059835 
## [59] train-error:0.059576 
## [60] train-error:0.058671 
## [61] train-error:0.059188 
## [62] train-error:0.059318 
## [63] train-error:0.058671 
## [64] train-error:0.058930 
## [65] train-error:0.059188 
## [66] train-error:0.058413 
## [67] train-error:0.058542 
## [68] train-error:0.057896 
## [69] train-error:0.058413 
## [70] train-error:0.057767 
## [71] train-error:0.056604 
## [72] train-error:0.057121 
## [73] train-error:0.057250 
## [74] train-error:0.056604 
## [75] train-error:0.056733 
## [76] train-error:0.056475 
## [77] train-error:0.056345 
## [78] train-error:0.055828 
## [79] train-error:0.056087 
## [80] train-error:0.056345 
## [81] train-error:0.056345 
## [82] train-error:0.055828 
## [83] train-error:0.056087 
## [84] train-error:0.053761 
## [85] train-error:0.053114 
## [86] train-error:0.053244 
## [87] train-error:0.052468 
## [88] train-error:0.052727 
## [89] train-error:0.052985 
## [90] train-error:0.052985 
## [91] train-error:0.051951 
## [92] train-error:0.051822 
## [93] train-error:0.052339 
## [94] train-error:0.051693 
## [95] train-error:0.051564 
## [96] train-error:0.050401 
## [97] train-error:0.050142 
## [98] train-error:0.049754 
## [99] train-error:0.049754 
## [100]    train-error:0.050401 
## [101]    train-error:0.050271 
## [102]    train-error:0.049754 
## [103]    train-error:0.049496 
## [104]    train-error:0.048979 
## [105]    train-error:0.049496 
## [106]    train-error:0.049625 
## [107]    train-error:0.050013 
## [108]    train-error:0.049754 
## [109]    train-error:0.049238 
## [110]    train-error:0.049496 
## [111]    train-error:0.049754 
## [112]    train-error:0.049367 
## [113]    train-error:0.049496 
## [114]    train-error:0.048591 
## [115]    train-error:0.048074 
## [116]    train-error:0.047558 
## [117]    train-error:0.046782 
## [118]    train-error:0.047170 
## [119]    train-error:0.046265 
## [120]    train-error:0.046394 
## [121]    train-error:0.046653 
## [122]    train-error:0.047170 
## [123]    train-error:0.047299 
## [124]    train-error:0.047428 
## [125]    train-error:0.047299 
## [126]    train-error:0.047041 
## [127]    train-error:0.047170 
## [128]    train-error:0.047041 
## [129]    train-error:0.046653 
## [130]    train-error:0.046394 
## [131]    train-error:0.047041 
## [132]    train-error:0.046782 
## [133]    train-error:0.046653 
## [134]    train-error:0.046265 
## [135]    train-error:0.046265 
## [136]    train-error:0.046394 
## [137]    train-error:0.045490 
## [138]    train-error:0.044973 
## [139]    train-error:0.044585 
## [140]    train-error:0.045102 
## [141]    train-error:0.045490 
## [142]    train-error:0.045102 
## [143]    train-error:0.045102 
## [144]    train-error:0.045619 
## [145]    train-error:0.044973 
## [146]    train-error:0.044844 
## [147]    train-error:0.044714 
## [148]    train-error:0.044714 
## [149]    train-error:0.044844 
## [150]    train-error:0.044973 
## [151]    train-error:0.044973 
## [152]    train-error:0.044973 
## [153]    train-error:0.045231 
## [154]    train-error:0.044973 
## [155]    train-error:0.044714 
## [156]    train-error:0.044844 
## [157]    train-error:0.043810 
## [158]    train-error:0.043810 
## [159]    train-error:0.043681 
## [160]    train-error:0.043681 
## [161]    train-error:0.043293 
## [162]    train-error:0.043293 
## [163]    train-error:0.043681 
## [164]    train-error:0.043551 
## [165]    train-error:0.043293 
## [166]    train-error:0.043422 
## [167]    train-error:0.043551 
## [168]    train-error:0.043164 
## [169]    train-error:0.043551 
## [170]    train-error:0.043034 
## [171]    train-error:0.043164 
## [172]    train-error:0.043034 
## [173]    train-error:0.042388 
## [174]    train-error:0.042517 
## [175]    train-error:0.042517 
## [176]    train-error:0.042517 
## [177]    train-error:0.042259 
## [178]    train-error:0.042259 
## [179]    train-error:0.042388 
## [180]    train-error:0.041613 
## [181]    train-error:0.041484 
## [182]    train-error:0.041096 
## [183]    train-error:0.041225 
## [184]    train-error:0.041225 
## [185]    train-error:0.041096 
## [186]    train-error:0.041225 
## [187]    train-error:0.041225 
## [188]    train-error:0.041225 
## [189]    train-error:0.041484 
## [190]    train-error:0.041613 
## [191]    train-error:0.041742 
## [192]    train-error:0.041613 
## [193]    train-error:0.041484 
## [194]    train-error:0.041484 
## [195]    train-error:0.040967 
## [196]    train-error:0.040967 
## [197]    train-error:0.041354 
## [198]    train-error:0.040967 
## [199]    train-error:0.040837 
## [200]    train-error:0.040967 
## [201]    train-error:0.040837 
## [202]    train-error:0.040579 
## [203]    train-error:0.040579 
## [204]    train-error:0.040579 
## [205]    train-error:0.040191 
## [206]    train-error:0.040320 
## [207]    train-error:0.040320 
## [208]    train-error:0.040450 
## [209]    train-error:0.040450 
## [210]    train-error:0.040967 
## [211]    train-error:0.041354 
## [212]    train-error:0.040708 
## [213]    train-error:0.041225 
## [214]    train-error:0.040708 
## [215]    train-error:0.040967 
## [216]    train-error:0.040967 
## [217]    train-error:0.041096 
## [218]    train-error:0.040967 
## [219]    train-error:0.041354 
## [220]    train-error:0.041354 
## [221]    train-error:0.042001 
## [222]    train-error:0.042130 
## [223]    train-error:0.041354 
## [224]    train-error:0.041484 
## [225]    train-error:0.041484 
## [226]    train-error:0.040967 
## [227]    train-error:0.040579 
## [228]    train-error:0.040320 
## [229]    train-error:0.040450 
## [230]    train-error:0.040708 
## [231]    train-error:0.040320 
## [232]    train-error:0.040450 
## [233]    train-error:0.040320 
## [234]    train-error:0.040320 
## [235]    train-error:0.040062 
## [236]    train-error:0.039416 
## [237]    train-error:0.039804 
## [238]    train-error:0.040191 
## [239]    train-error:0.040579 
## [240]    train-error:0.039933 
## [241]    train-error:0.039804 
## [242]    train-error:0.040320 
## [243]    train-error:0.040191 
## [244]    train-error:0.040062 
## [245]    train-error:0.039933 
## [246]    train-error:0.040062 
## [247]    train-error:0.040191 
## [248]    train-error:0.040062 
## [249]    train-error:0.039804 
## [250]    train-error:0.039545 
## [251]    train-error:0.039157 
## [252]    train-error:0.039028 
## [253]    train-error:0.038899 
## [254]    train-error:0.038899 
## [255]    train-error:0.038899 
## [256]    train-error:0.038770 
## [257]    train-error:0.039157 
## [258]    train-error:0.038640 
## [259]    train-error:0.038253 
## [260]    train-error:0.038253 
## [261]    train-error:0.037865 
## [262]    train-error:0.038253 
## [263]    train-error:0.038640 
## [264]    train-error:0.038511 
## [265]    train-error:0.039028 
## [266]    train-error:0.038770 
## [267]    train-error:0.038770 
## [268]    train-error:0.038640 
## [269]    train-error:0.038770 
## [270]    train-error:0.038770 
## [271]    train-error:0.039028 
## [272]    train-error:0.039157 
## [273]    train-error:0.038511 
## [274]    train-error:0.038382 
## [275]    train-error:0.038253 
## [276]    train-error:0.038124 
## [277]    train-error:0.037865 
## [278]    train-error:0.038382 
## [279]    train-error:0.038640 
## [280]    train-error:0.038124 
## [281]    train-error:0.037219 
## [282]    train-error:0.037348 
## [283]    train-error:0.037219 
## [284]    train-error:0.037348 
## [285]    train-error:0.037477 
## [286]    train-error:0.037607 
## [287]    train-error:0.037607 
## [288]    train-error:0.037090 
## [289]    train-error:0.036960 
## [290]    train-error:0.037090 
## [291]    train-error:0.036960 
## [292]    train-error:0.037090 
## [293]    train-error:0.037348 
## [294]    train-error:0.037348 
## [295]    train-error:0.037348 
## [296]    train-error:0.037348 
## [297]    train-error:0.037090 
## [298]    train-error:0.036831 
## [299]    train-error:0.036831 
## [300]    train-error:0.036960 
## [301]    train-error:0.037219 
## [302]    train-error:0.036831 
## [303]    train-error:0.036314 
## [304]    train-error:0.036444 
## [305]    train-error:0.036831 
## [306]    train-error:0.036831 
## [307]    train-error:0.037477 
## [308]    train-error:0.036960 
## [309]    train-error:0.036573 
## [310]    train-error:0.037477 
## [311]    train-error:0.037348 
## [312]    train-error:0.037348 
## [313]    train-error:0.036831 
## [314]    train-error:0.036702 
## [315]    train-error:0.036702 
## [316]    train-error:0.036573 
## [317]    train-error:0.036573 
## [318]    train-error:0.036702 
## [319]    train-error:0.037348 
## [320]    train-error:0.036960 
## [321]    train-error:0.036960 
## [322]    train-error:0.037090 
## [323]    train-error:0.036444 
## [324]    train-error:0.036573 
## [325]    train-error:0.036960 
## [326]    train-error:0.035927 
## [327]    train-error:0.036444 
## [328]    train-error:0.036185 
## [329]    train-error:0.035410 
## [330]    train-error:0.035797 
## [331]    train-error:0.036056 
## [332]    train-error:0.036573 
## [333]    train-error:0.035797 
## [334]    train-error:0.035280 
## [335]    train-error:0.036185 
## [336]    train-error:0.036831 
## [337]    train-error:0.036702 
## [338]    train-error:0.036573 
## [339]    train-error:0.036444 
## [340]    train-error:0.036185 
## [341]    train-error:0.035927 
## [342]    train-error:0.035797 
## [343]    train-error:0.035927 
## [344]    train-error:0.036185 
## [345]    train-error:0.036056 
## [346]    train-error:0.035797 
## [347]    train-error:0.036314 
## [348]    train-error:0.035280 
## [349]    train-error:0.035410 
## [350]    train-error:0.035022 
## [351]    train-error:0.034634 
## [352]    train-error:0.035410 
## [353]    train-error:0.036185 
## [354]    train-error:0.035151 
## [355]    train-error:0.036185 
## [356]    train-error:0.035539 
## [357]    train-error:0.035539 
## [358]    train-error:0.035022 
## [359]    train-error:0.034764 
## [360]    train-error:0.034893 
## [361]    train-error:0.034634 
## [362]    train-error:0.034117 
## [363]    train-error:0.034376 
## [364]    train-error:0.034505 
## [365]    train-error:0.034764 
## [366]    train-error:0.034376 
## [367]    train-error:0.034117 
## [368]    train-error:0.034634 
## [369]    train-error:0.034634 
## [370]    train-error:0.034376 
## [371]    train-error:0.034376 
## [372]    train-error:0.034634 
## [373]    train-error:0.034376 
## [374]    train-error:0.034117 
## [375]    train-error:0.034117 
## [376]    train-error:0.033859 
## [377]    train-error:0.033730 
## [378]    train-error:0.033342 
## [379]    train-error:0.033342 
## [380]    train-error:0.033730 
## [381]    train-error:0.034117 
## [382]    train-error:0.033730 
## [383]    train-error:0.033213 
## [384]    train-error:0.033600 
## [385]    train-error:0.034247 
## [386]    train-error:0.033471 
## [387]    train-error:0.033730 
## [388]    train-error:0.033213 
## [389]    train-error:0.033730 
## [390]    train-error:0.033213 
## [391]    train-error:0.032567 
## [392]    train-error:0.032954 
## [393]    train-error:0.032696 
## [394]    train-error:0.032825 
## [395]    train-error:0.032437 
## [396]    train-error:0.032437 
## [397]    train-error:0.032437 
## [398]    train-error:0.032696 
## [399]    train-error:0.032954 
## [400]    train-error:0.032696 
## [401]    train-error:0.032567 
## [402]    train-error:0.032437 
## [403]    train-error:0.032308 
## [404]    train-error:0.032308 
## [405]    train-error:0.032437 
## [406]    train-error:0.032567 
## [407]    train-error:0.032437 
## [408]    train-error:0.032437 
## [409]    train-error:0.032437 
## [410]    train-error:0.032437 
## [411]    train-error:0.032954 
## [412]    train-error:0.032954 
## [413]    train-error:0.032567 
## [414]    train-error:0.032437 
## [415]    train-error:0.032179 
## [416]    train-error:0.031920 
## [417]    train-error:0.032050 
## [418]    train-error:0.031533 
## [419]    train-error:0.031403 
## [420]    train-error:0.031791 
## [421]    train-error:0.031920 
## [422]    train-error:0.031274 
## [423]    train-error:0.031533 
## [424]    train-error:0.031662 
## [425]    train-error:0.031533 
## [426]    train-error:0.032179 
## [427]    train-error:0.031920 
## [428]    train-error:0.031920 
## [429]    train-error:0.031662 
## [430]    train-error:0.031533 
## [431]    train-error:0.031274 
## [432]    train-error:0.030757 
## [433]    train-error:0.030757 
## [434]    train-error:0.030887 
## [435]    train-error:0.030757 
## [436]    train-error:0.031145 
## [437]    train-error:0.030887 
## [438]    train-error:0.030370 
## [439]    train-error:0.030757 
## [440]    train-error:0.030499 
## [441]    train-error:0.030111 
## [442]    train-error:0.029982 
## [443]    train-error:0.030240 
## [444]    train-error:0.030240 
## [445]    train-error:0.030240 
## [446]    train-error:0.029982 
## [447]    train-error:0.029723 
## [448]    train-error:0.030240 
## [449]    train-error:0.030111 
## [450]    train-error:0.030111 
## [451]    train-error:0.030370 
## [452]    train-error:0.030111 
## [453]    train-error:0.030370 
## [454]    train-error:0.030111 
## [455]    train-error:0.030111 
## [456]    train-error:0.031016 
## [457]    train-error:0.030499 
## [458]    train-error:0.031016 
## [459]    train-error:0.031016 
## [460]    train-error:0.031016 
## [461]    train-error:0.031016 
## [462]    train-error:0.030887 
## [463]    train-error:0.030757 
## [464]    train-error:0.031016 
## [465]    train-error:0.031016 
## [466]    train-error:0.030240 
## [467]    train-error:0.029853 
## [468]    train-error:0.030240 
## [469]    train-error:0.031145 
## [470]    train-error:0.031145 
## [471]    train-error:0.031145 
## [472]    train-error:0.031145 
## [473]    train-error:0.030628 
## [474]    train-error:0.030240 
## [475]    train-error:0.030111 
## [476]    train-error:0.030111 
## [477]    train-error:0.030111 
## [478]    train-error:0.030111 
## [479]    train-error:0.030111 
## [480]    train-error:0.031016 
## [481]    train-error:0.030887 
## [482]    train-error:0.031016 
## [483]    train-error:0.030887 
## [484]    train-error:0.029982 
## [485]    train-error:0.030370 
## [486]    train-error:0.030628 
## [487]    train-error:0.030628 
## [488]    train-error:0.031145 
## [489]    train-error:0.030370 
## [490]    train-error:0.030370 
## [491]    train-error:0.030240 
## [492]    train-error:0.030240 
## [493]    train-error:0.030499 
## [494]    train-error:0.030240 
## [495]    train-error:0.030628 
## [496]    train-error:0.029853 
## [497]    train-error:0.030240 
## [498]    train-error:0.030111 
## [499]    train-error:0.030111 
## [500]    train-error:0.030111 
## [501]    train-error:0.030111 
## [502]    train-error:0.030111 
## [503]    train-error:0.029982 
## [504]    train-error:0.029594 
## [505]    train-error:0.029853 
## [506]    train-error:0.029853 
## [507]    train-error:0.029853 
## [508]    train-error:0.030111 
## [509]    train-error:0.029853 
## [510]    train-error:0.029594 
## [511]    train-error:0.029723 
## [512]    train-error:0.029723 
## [513]    train-error:0.029853 
## [514]    train-error:0.029465 
## [515]    train-error:0.029982 
## [516]    train-error:0.029336 
## [517]    train-error:0.029465 
## [518]    train-error:0.029336 
## [519]    train-error:0.029336 
## [520]    train-error:0.029336 
## [521]    train-error:0.029336 
## [522]    train-error:0.029465 
## [523]    train-error:0.029336 
## [524]    train-error:0.029853 
## [525]    train-error:0.029982 
## [526]    train-error:0.029853 
## [527]    train-error:0.030111 
## [528]    train-error:0.029723 
## [529]    train-error:0.029594 
## [530]    train-error:0.029336 
## [531]    train-error:0.029336 
## [532]    train-error:0.029077 
## [533]    train-error:0.029077 
## [534]    train-error:0.029336 
## [535]    train-error:0.029077 
## [536]    train-error:0.029207 
## [537]    train-error:0.029207 
## [538]    train-error:0.029465 
## [539]    train-error:0.029207 
## [540]    train-error:0.029594 
## [541]    train-error:0.029207 
## [542]    train-error:0.029336 
## [543]    train-error:0.029723 
## [544]    train-error:0.029465 
## [545]    train-error:0.029207 
## [546]    train-error:0.029594 
## [547]    train-error:0.029465 
## [548]    train-error:0.029853 
## [549]    train-error:0.029594 
## [550]    train-error:0.029594 
## [551]    train-error:0.029594 
## [552]    train-error:0.029723 
## [553]    train-error:0.029594 
## [554]    train-error:0.029723 
## [555]    train-error:0.030111 
## [556]    train-error:0.029723 
## [557]    train-error:0.030111 
## [558]    train-error:0.029594 
## [559]    train-error:0.029853 
## [560]    train-error:0.029853 
## [561]    train-error:0.029853 
## [562]    train-error:0.029853 
## [563]    train-error:0.029853 
## [564]    train-error:0.029853 
## [565]    train-error:0.029723 
## [566]    train-error:0.029982 
## [567]    train-error:0.029594 
## [568]    train-error:0.029465 
## [569]    train-error:0.029336 
## [570]    train-error:0.029336 
## [571]    train-error:0.029077 
## [572]    train-error:0.029207 
## [573]    train-error:0.028690 
## [574]    train-error:0.028948 
## [575]    train-error:0.028948 
## [576]    train-error:0.028819 
## [577]    train-error:0.028560 
## [578]    train-error:0.028431 
## [579]    train-error:0.028302 
## [580]    train-error:0.028302 
## [581]    train-error:0.028690 
## [582]    train-error:0.028560 
## [583]    train-error:0.028690 
## [584]    train-error:0.028302 
## [585]    train-error:0.028173 
## [586]    train-error:0.028560 
## [587]    train-error:0.028173 
## [588]    train-error:0.028043 
## [589]    train-error:0.028043 
## [590]    train-error:0.028043 
## [591]    train-error:0.028302 
## [592]    train-error:0.028173 
## [593]    train-error:0.028431 
## [594]    train-error:0.028431 
## [595]    train-error:0.028173 
## [596]    train-error:0.028173 
## [597]    train-error:0.028043 
## [598]    train-error:0.028173 
## [599]    train-error:0.028173 
## [600]    train-error:0.028173 
## [601]    train-error:0.028173 
## [602]    train-error:0.028302 
## [603]    train-error:0.028431 
## [604]    train-error:0.027914 
## [605]    train-error:0.027914 
## [606]    train-error:0.027785 
## [607]    train-error:0.027785 
## [608]    train-error:0.027785 
## [609]    train-error:0.027785 
## [610]    train-error:0.027526 
## [611]    train-error:0.028173 
## [612]    train-error:0.028173 
## [613]    train-error:0.028302 
## [614]    train-error:0.027914 
## [615]    train-error:0.028173 
## [616]    train-error:0.027656 
## [617]    train-error:0.027526 
## [618]    train-error:0.028173 
## [619]    train-error:0.028173 
## [620]    train-error:0.027785 
## [621]    train-error:0.028043 
## [622]    train-error:0.027914 
## [623]    train-error:0.027914 
## [624]    train-error:0.027914 
## [625]    train-error:0.027656 
## [626]    train-error:0.027139 
## [627]    train-error:0.027139 
## [628]    train-error:0.026622 
## [629]    train-error:0.027010 
## [630]    train-error:0.026493 
## [631]    train-error:0.026493 
## [632]    train-error:0.026880 
## [633]    train-error:0.027010 
## [634]    train-error:0.026751 
## [635]    train-error:0.026363 
## [636]    train-error:0.026493 
## [637]    train-error:0.026622 
## [638]    train-error:0.026751 
## [639]    train-error:0.026880 
## [640]    train-error:0.027139 
## [641]    train-error:0.027139 
## [642]    train-error:0.026622 
## [643]    train-error:0.026622 
## [644]    train-error:0.026622 
## [645]    train-error:0.026622 
## [646]    train-error:0.026622 
## [647]    train-error:0.026622 
## [648]    train-error:0.026880 
## [649]    train-error:0.027397 
## [650]    train-error:0.027010 
## [651]    train-error:0.026880 
## [652]    train-error:0.026880 
## [653]    train-error:0.026622 
## [654]    train-error:0.026622 
## [655]    train-error:0.026622 
## [656]    train-error:0.026493 
## [657]    train-error:0.026493 
## [658]    train-error:0.026363 
## [659]    train-error:0.026363 
## [660]    train-error:0.026363 
## [661]    train-error:0.027139 
## [662]    train-error:0.026880 
## [663]    train-error:0.027139 
## [664]    train-error:0.026880 
## [665]    train-error:0.026363 
## [666]    train-error:0.026622 
## [667]    train-error:0.026622 
## [668]    train-error:0.026622 
## [669]    train-error:0.027010 
## [670]    train-error:0.026751 
## [671]    train-error:0.026880 
## [672]    train-error:0.026234 
## [673]    train-error:0.025976 
## [674]    train-error:0.026751 
## [675]    train-error:0.026751 
## [676]    train-error:0.025976 
## [677]    train-error:0.026105 
## [678]    train-error:0.026363 
## [679]    train-error:0.026751 
## [680]    train-error:0.026751 
## [681]    train-error:0.026493 
## [682]    train-error:0.026880 
## [683]    train-error:0.026622 
## [684]    train-error:0.026880 
## [685]    train-error:0.026751 
## [686]    train-error:0.026622 
## [687]    train-error:0.026493 
## [688]    train-error:0.026622 
## [689]    train-error:0.026622 
## [690]    train-error:0.026234 
## [691]    train-error:0.025588 
## [692]    train-error:0.025588 
## [693]    train-error:0.025717 
## [694]    train-error:0.025846 
## [695]    train-error:0.025717 
## [696]    train-error:0.025976 
## [697]    train-error:0.026105 
## [698]    train-error:0.026234 
## [699]    train-error:0.026234 
## [700]    train-error:0.026363 
## [701]    train-error:0.025976 
## [702]    train-error:0.026234 
## [703]    train-error:0.026105 
## [704]    train-error:0.026234 
## [705]    train-error:0.026234 
## [706]    train-error:0.025846 
## [707]    train-error:0.026622 
## [708]    train-error:0.026363 
## [709]    train-error:0.026234 
## [710]    train-error:0.025846 
## [711]    train-error:0.025846 
## [712]    train-error:0.025846 
## [713]    train-error:0.026105 
## [714]    train-error:0.026105 
## [715]    train-error:0.026105 
## [716]    train-error:0.025976 
## [717]    train-error:0.025976 
## [718]    train-error:0.025846 
## [719]    train-error:0.026105 
## [720]    train-error:0.025717 
## [721]    train-error:0.026105 
## [722]    train-error:0.025717 
## [723]    train-error:0.025976 
## [724]    train-error:0.025846 
## [725]    train-error:0.025846 
## [726]    train-error:0.025846 
## [727]    train-error:0.025976 
## [728]    train-error:0.025976 
## [729]    train-error:0.025846 
## [730]    train-error:0.025717 
## [731]    train-error:0.025717 
## [732]    train-error:0.025717 
## [733]    train-error:0.025717 
## [734]    train-error:0.025717 
## [735]    train-error:0.025717 
## [736]    train-error:0.025717 
## [737]    train-error:0.025588 
## [738]    train-error:0.025459 
## [739]    train-error:0.025459 
## [740]    train-error:0.025330 
## [741]    train-error:0.025071 
## [742]    train-error:0.024813 
## [743]    train-error:0.024813 
## [744]    train-error:0.024813 
## [745]    train-error:0.025071 
## [746]    train-error:0.025200 
## [747]    train-error:0.025200 
## [748]    train-error:0.025200 
## [749]    train-error:0.025071 
## [750]    train-error:0.025200 
## [751]    train-error:0.025071 
## [752]    train-error:0.024942 
## [753]    train-error:0.024942 
## [754]    train-error:0.024942 
## [755]    train-error:0.024942 
## [756]    train-error:0.024942 
## [757]    train-error:0.024683 
## [758]    train-error:0.024942 
## [759]    train-error:0.025071 
## [760]    train-error:0.024425 
## [761]    train-error:0.024296 
## [762]    train-error:0.024166 
## [763]    train-error:0.024813 
## [764]    train-error:0.024813 
## [765]    train-error:0.024813 
## [766]    train-error:0.025200 
## [767]    train-error:0.025200 
## [768]    train-error:0.024942 
## [769]    train-error:0.025071 
## [770]    train-error:0.025071 
## [771]    train-error:0.024813 
## [772]    train-error:0.024425 
## [773]    train-error:0.024425 
## [774]    train-error:0.024166 
## [775]    train-error:0.024296 
## [776]    train-error:0.024296 
## [777]    train-error:0.024813 
## [778]    train-error:0.024942 
## [779]    train-error:0.024942 
## [780]    train-error:0.025200 
## [781]    train-error:0.024683 
## [782]    train-error:0.025588 
## [783]    train-error:0.025330 
## [784]    train-error:0.025717 
## [785]    train-error:0.025459 
## [786]    train-error:0.025846 
## [787]    train-error:0.025459 
## [788]    train-error:0.025330 
## [789]    train-error:0.025330 
## [790]    train-error:0.024683 
## [791]    train-error:0.024942 
## [792]    train-error:0.024813 
## [793]    train-error:0.024683 
## [794]    train-error:0.024683 
## [795]    train-error:0.024425 
## [796]    train-error:0.024683 
## [797]    train-error:0.024683 
## [798]    train-error:0.024683 
## [799]    train-error:0.024683 
## [800]    train-error:0.024813 
## [801]    train-error:0.024942 
## [802]    train-error:0.024942 
## [803]    train-error:0.025071 
## [804]    train-error:0.024942 
## [805]    train-error:0.024942 
## [806]    train-error:0.024942 
## [807]    train-error:0.024554 
## [808]    train-error:0.024683 
## [809]    train-error:0.024683 
## [810]    train-error:0.024942 
## [811]    train-error:0.025071 
## [812]    train-error:0.024942 
## [813]    train-error:0.025071 
## [814]    train-error:0.025071 
## [815]    train-error:0.025071 
## [816]    train-error:0.024942 
## [817]    train-error:0.025200 
## [818]    train-error:0.025717 
## [819]    train-error:0.025459 
## [820]    train-error:0.025588 
## [821]    train-error:0.025588 
## [822]    train-error:0.025330 
## [823]    train-error:0.025071 
## [824]    train-error:0.025200 
## [825]    train-error:0.025459 
## [826]    train-error:0.025459 
## [827]    train-error:0.025200 
## [828]    train-error:0.025071 
## [829]    train-error:0.025200 
## [830]    train-error:0.024425 
## [831]    train-error:0.024166 
## [832]    train-error:0.024166 
## [833]    train-error:0.024166 
## [834]    train-error:0.024166 
## [835]    train-error:0.024296 
## [836]    train-error:0.024166 
## [837]    train-error:0.024166 
## [838]    train-error:0.024425 
## [839]    train-error:0.024554 
## [840]    train-error:0.024425 
## [841]    train-error:0.024037 
## [842]    train-error:0.024296 
## [843]    train-error:0.024296 
## [844]    train-error:0.024037 
## [845]    train-error:0.024037 
## [846]    train-error:0.023908 
## [847]    train-error:0.024037 
## [848]    train-error:0.024037 
## [849]    train-error:0.023908 
## [850]    train-error:0.024296 
## [851]    train-error:0.024554 
## [852]    train-error:0.024296 
## [853]    train-error:0.024166 
## [854]    train-error:0.024166 
## [855]    train-error:0.023908 
## [856]    train-error:0.023908 
## [857]    train-error:0.023520 
## [858]    train-error:0.023520 
## [859]    train-error:0.023908 
## [860]    train-error:0.023779 
## [861]    train-error:0.024037 
## [862]    train-error:0.023908 
## [863]    train-error:0.024166 
## [864]    train-error:0.024554 
## [865]    train-error:0.024296 
## [866]    train-error:0.024166 
## [867]    train-error:0.023650 
## [868]    train-error:0.023779 
## [869]    train-error:0.023262 
## [870]    train-error:0.023391 
## [871]    train-error:0.023520 
## [872]    train-error:0.023520 
## [873]    train-error:0.023650 
## [874]    train-error:0.023650 
## [875]    train-error:0.023779 
## [876]    train-error:0.023520 
## [877]    train-error:0.023650 
## [878]    train-error:0.023262 
## [879]    train-error:0.023520 
## [880]    train-error:0.023391 
## [881]    train-error:0.023391 
## [882]    train-error:0.023520 
## [883]    train-error:0.023262 
## [884]    train-error:0.023133 
## [885]    train-error:0.023779 
## [886]    train-error:0.024037 
## [887]    train-error:0.023520 
## [888]    train-error:0.023391 
## [889]    train-error:0.023391 
## [890]    train-error:0.023520 
## [891]    train-error:0.023779 
## [892]    train-error:0.023779 
## [893]    train-error:0.023650 
## [894]    train-error:0.023650 
## [895]    train-error:0.023908 
## [896]    train-error:0.023650 
## [897]    train-error:0.023779 
## [898]    train-error:0.023650 
## [899]    train-error:0.023779 
## [900]    train-error:0.024037 
## [901]    train-error:0.023779 
## [902]    train-error:0.023650 
## [903]    train-error:0.023650 
## [904]    train-error:0.023908 
## [905]    train-error:0.023650 
## [906]    train-error:0.023650 
## [907]    train-error:0.023650 
## [908]    train-error:0.023779 
## [909]    train-error:0.023779 
## [910]    train-error:0.023779 
## [911]    train-error:0.024166 
## [912]    train-error:0.023908 
## [913]    train-error:0.023779 
## [914]    train-error:0.023779 
## [915]    train-error:0.024554 
## [916]    train-error:0.024554 
## [917]    train-error:0.024296 
## [918]    train-error:0.024166 
## [919]    train-error:0.023908 
## [920]    train-error:0.024166 
## [921]    train-error:0.024296 
## [922]    train-error:0.024425 
## [923]    train-error:0.023908 
## [924]    train-error:0.023908 
## [925]    train-error:0.023520 
## [926]    train-error:0.023391 
## [927]    train-error:0.023133 
## [928]    train-error:0.023520 
## [929]    train-error:0.023908 
## [930]    train-error:0.024166 
## [931]    train-error:0.023650 
## [932]    train-error:0.023779 
## [933]    train-error:0.024166 
## [934]    train-error:0.023262 
## [935]    train-error:0.023003 
## [936]    train-error:0.023650 
## [937]    train-error:0.023391 
## [938]    train-error:0.023391 
## [939]    train-error:0.023520 
## [940]    train-error:0.023520 
## [941]    train-error:0.023650 
## [942]    train-error:0.023908 
## [943]    train-error:0.023520 
## [944]    train-error:0.023520 
## [945]    train-error:0.023908 
## [946]    train-error:0.023908 
## [947]    train-error:0.024166 
## [948]    train-error:0.024037 
## [949]    train-error:0.024037 
## [950]    train-error:0.024166 
## [951]    train-error:0.024166 
## [952]    train-error:0.024166 
## [953]    train-error:0.023779 
## [954]    train-error:0.023908 
## [955]    train-error:0.023779 
## [956]    train-error:0.023262 
## [957]    train-error:0.023520 
## [958]    train-error:0.023520 
## [959]    train-error:0.023520 
## [960]    train-error:0.023520 
## [961]    train-error:0.023391 
## [962]    train-error:0.023391 
## [963]    train-error:0.023650 
## [964]    train-error:0.023391 
## [965]    train-error:0.023391 
## [966]    train-error:0.023391 
## [967]    train-error:0.023391 
## [968]    train-error:0.023391 
## [969]    train-error:0.023262 
## [970]    train-error:0.023133 
## [971]    train-error:0.022874 
## [972]    train-error:0.023003 
## [973]    train-error:0.023262 
## [974]    train-error:0.023003 
## [975]    train-error:0.023003 
## [976]    train-error:0.023003 
## [977]    train-error:0.023003 
## [978]    train-error:0.023133 
## [979]    train-error:0.023133 
## [980]    train-error:0.023262 
## [981]    train-error:0.023133 
## [982]    train-error:0.023262 
## [983]    train-error:0.023520 
## [984]    train-error:0.023262 
## [985]    train-error:0.022874 
## [986]    train-error:0.022745 
## [987]    train-error:0.022745 
## [988]    train-error:0.023133 
## [989]    train-error:0.022874 
## [990]    train-error:0.022874 
## [991]    train-error:0.022874 
## [992]    train-error:0.023133 
## [993]    train-error:0.023133 
## [994]    train-error:0.022616 
## [995]    train-error:0.023003 
## [996]    train-error:0.023003 
## [997]    train-error:0.023133 
## [998]    train-error:0.023133 
## [999]    train-error:0.023133 
## [1000]   train-error:0.023133

tree4.predict = predict(tree4.train, newdata=data.matrix(trainingset[,-31]),type="class")

tree4.predict<-round(tree4.predict,0)
c4<-confusionMatrix(factor(tree4.predict),factor(trainingset$Class))
c4
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3316   77
##          1  102 4243
##                                           
##                Accuracy : 0.9769          
##                  95% CI : (0.9733, 0.9801)
##     No Information Rate : 0.5583          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9531          
##                                           
##  Mcnemar's Test P-Value : 0.07284         
##                                           
##             Sensitivity : 0.9702          
##             Specificity : 0.9822          
##          Pos Pred Value : 0.9773          
##          Neg Pred Value : 0.9765          
##              Prevalence : 0.4417          
##          Detection Rate : 0.4285          
##    Detection Prevalence : 0.4385          
##       Balanced Accuracy : 0.9762          
##                                           
##        'Positive' Class : 0               
## 

Our latest model has the highest accaracy score <0.97. Let’s see if other model categories can do better !


Regression model

First let’s recall that regression model uses a linear combination of the predictors

\[ \eta({\bf x}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_{p - 1} x_{p - 1} \] Like ordinary linear regression,we will run glm model and perform common hypothesis testing like the Wald-test using P-value.

\[ H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0 \]

GLM <- glm(Class ~.,family=binomial(logit), data=trainingset)
summary(GLM)
## 
## Call:
## glm(formula = Class ~ ., family = binomial(logit), data = trainingset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2301  -0.0445   0.0000   0.1548   3.1477  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -6.46835    0.66694  -9.699  < 2e-16 ***
## HavingIP1        1.82176    0.17400  10.470  < 2e-16 ***
## LongURL0        -0.61566    0.56618  -1.087 0.276865    
## LongURL-1       -0.17935    0.21185  -0.847 0.397223    
## ShortURL-1       1.35503    0.37405   3.623 0.000292 ***
## Symbol-1        -0.37081    0.23079  -1.607 0.108124    
## ddRedirecting1   0.40411    0.46408   0.871 0.383871    
## PrefixSuffix1   18.19612  259.35659   0.070 0.944067    
## SubDomain0      -0.01111    0.14386  -0.077 0.938438    
## SubDomain1       1.46232    0.14940   9.788  < 2e-16 ***
## HTTPS1           3.31983    0.13442  24.698  < 2e-16 ***
## HTTPS0          -2.14136    0.38437  -5.571 2.53e-08 ***
## DomainRegLen1   -0.54259    0.15414  -3.520 0.000431 ***
## Favicon-1        0.74898    0.49198   1.522 0.127919    
## Port-1          -0.56953    0.44727  -1.273 0.202888    
## HTTPsToken1     -1.23516    0.31518  -3.919 8.89e-05 ***
## RequestURL-1    -0.31667    0.14517  -2.181 0.029157 *  
## AnchorURL0       5.08621    0.32591  15.606  < 2e-16 ***
## AnchorURL1       7.02454    0.37321  18.822  < 2e-16 ***
## LinksInTag-1    -1.18591    0.16409  -7.227 4.94e-13 ***
## LinksInTag0      0.33349    0.17034   1.958 0.050255 .  
## SFH1             1.10353    0.20767   5.314 1.07e-07 ***
## SFH0             1.40274    0.25842   5.428 5.70e-08 ***
## SubEmail1        0.03552    0.26835   0.132 0.894694    
## AbnormalURL1    -0.65947    0.33995  -1.940 0.052394 .  
## Redirect1       -1.04356    0.24849  -4.200 2.67e-05 ***
## OnMouseover-1   -0.39531    0.37075  -1.066 0.286305    
## RightClick-1    -0.46711    0.47480  -0.984 0.325209    
## PopUp-1          0.32236    0.47623   0.677 0.498463    
## Iframe-1         0.73470    0.43783   1.678 0.093336 .  
## AgeOfDomain1    -0.14691    0.12718  -1.155 0.248044    
## DNSRecord1       1.61293    0.17201   9.377  < 2e-16 ***
## WebTraffic0     -1.73844    0.19543  -8.896  < 2e-16 ***
## WebTraffic1      0.60989    0.17094   3.568 0.000360 ***
## PageRank1        0.11900    0.14563   0.817 0.413842    
## GoogleIndex-1   -1.27195    0.16162  -7.870 3.55e-15 ***
## LinkToPage0     -1.81833    0.16890 -10.766  < 2e-16 ***
## LinkToPage-1    -1.43320    0.28010  -5.117 3.11e-07 ***
## StatsReport1     0.64961    0.23788   2.731 0.006316 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10621.8  on 7737  degrees of freedom
## Residual deviance:  2119.7  on 7699  degrees of freedom
## AIC: 2197.7
## 
## Number of Fisher Scoring iterations: 18
1 - (GLM$deviance/GLM$null.deviance) 
## [1] 0.800435

The results begin by reporting the distribution of residuals.
The table of “coefficients” presents the results of our regression analysis. We are particularly interested in columns 1, 2 and 5: the variable name, the regression coefficient of the variable, and whether the coefficient is significantly different from zero.

Instead of removing by hand variable wich are above 0.05 threshold, we use Stepvise Variable Selection so as to determine the set of variables that results with the minimum AIC and compare it with the variable underlined with decision tree.

WES.STEP <- step(GLM, direction="both", k=1)
## Start:  AIC=2158.73
## Class ~ HavingIP + LongURL + ShortURL + Symbol + ddRedirecting + 
##     PrefixSuffix + SubDomain + HTTPS + DomainRegLen + Favicon + 
##     Port + HTTPsToken + RequestURL + AnchorURL + LinksInTag + 
##     SFH + SubEmail + AbnormalURL + Redirect + OnMouseover + RightClick + 
##     PopUp + Iframe + AgeOfDomain + DNSRecord + WebTraffic + PageRank + 
##     GoogleIndex + LinkToPage + StatsReport
## 
##                 Df Deviance    AIC
## - SubEmail       1   2119.8 2157.8
## - PopUp          1   2120.2 2158.2
## - LongURL        2   2121.3 2158.3
## - PageRank       1   2120.4 2158.4
## - ddRedirecting  1   2120.5 2158.5
## - RightClick     1   2120.7 2158.7
## <none>               2119.7 2158.7
## - OnMouseover    1   2120.9 2158.9
## - AgeOfDomain    1   2121.1 2159.1
## - Port           1   2121.4 2159.4
## - Favicon        1   2122.0 2160.0
## - Symbol         1   2122.3 2160.3
## - Iframe         1   2122.6 2160.6
## - AbnormalURL    1   2123.6 2161.6
## - RequestURL     1   2124.5 2162.5
## - StatsReport    1   2127.2 2165.2
## - DomainRegLen   1   2132.2 2170.2
## - ShortURL       1   2134.3 2172.3
## - HTTPsToken     1   2135.8 2173.8
## - Redirect       1   2137.8 2175.8
## - SFH            2   2175.3 2212.3
## - GoogleIndex    1   2182.8 2220.8
## - DNSRecord      1   2212.1 2250.1
## - HavingIP       1   2237.1 2275.1
## - LinkToPage     2   2246.8 2283.8
## - SubDomain      2   2251.5 2288.5
## - LinksInTag     2   2269.6 2306.6
## - PrefixSuffix   1   2310.6 2348.6
## - WebTraffic     2   2411.2 2448.2
## - AnchorURL      2   3035.4 3072.4
## - HTTPS          2   3113.7 3150.7
## 
## Step:  AIC=2157.75
## Class ~ HavingIP + LongURL + ShortURL + Symbol + ddRedirecting + 
##     PrefixSuffix + SubDomain + HTTPS + DomainRegLen + Favicon + 
##     Port + HTTPsToken + RequestURL + AnchorURL + LinksInTag + 
##     SFH + AbnormalURL + Redirect + OnMouseover + RightClick + 
##     PopUp + Iframe + AgeOfDomain + DNSRecord + WebTraffic + PageRank + 
##     GoogleIndex + LinkToPage + StatsReport
## 
##                 Df Deviance    AIC
## - PopUp          1   2120.2 2157.2
## - LongURL        2   2121.3 2157.3
## - PageRank       1   2120.4 2157.4
## - ddRedirecting  1   2120.5 2157.5
## - RightClick     1   2120.7 2157.7
## <none>               2119.8 2157.8
## - OnMouseover    1   2120.9 2157.9
## - AgeOfDomain    1   2121.1 2158.1
## + SubEmail       1   2119.7 2158.7
## - Favicon        1   2122.0 2159.0
## - Port           1   2122.1 2159.1
## - Symbol         1   2122.3 2159.3
## - Iframe         1   2122.6 2159.6
## - AbnormalURL    1   2123.6 2160.6
## - RequestURL     1   2124.5 2161.5
## - StatsReport    1   2127.2 2164.2
## - DomainRegLen   1   2132.2 2169.2
## - ShortURL       1   2134.3 2171.3
## - HTTPsToken     1   2135.8 2172.8
## - Redirect       1   2137.9 2174.9
## - SFH            2   2175.3 2211.3
## - GoogleIndex    1   2182.9 2219.9
## - DNSRecord      1   2212.2 2249.2
## - HavingIP       1   2237.3 2274.3
## - LinkToPage     2   2247.0 2283.0
## - SubDomain      2   2251.9 2287.9
## - LinksInTag     2   2271.3 2307.3
## - PrefixSuffix   1   2310.7 2347.7
## - WebTraffic     2   2411.3 2447.3
## - AnchorURL      2   3035.8 3071.8
## - HTTPS          2   3114.0 3150.0
## 
## Step:  AIC=2157.22
## Class ~ HavingIP + LongURL + ShortURL + Symbol + ddRedirecting + 
##     PrefixSuffix + SubDomain + HTTPS + DomainRegLen + Favicon + 
##     Port + HTTPsToken + RequestURL + AnchorURL + LinksInTag + 
##     SFH + AbnormalURL + Redirect + OnMouseover + RightClick + 
##     Iframe + AgeOfDomain + DNSRecord + WebTraffic + PageRank + 
##     GoogleIndex + LinkToPage + StatsReport
## 
##                 Df Deviance    AIC
## - LongURL        2   2121.7 2156.7
## - PageRank       1   2120.9 2156.9
## - OnMouseover    1   2121.0 2157.0
## - ddRedirecting  1   2121.1 2157.1
## - RightClick     1   2121.1 2157.1
## <none>               2120.2 2157.2
## - AgeOfDomain    1   2121.6 2157.6
## + PopUp          1   2119.8 2157.8
## + SubEmail       1   2120.2 2158.2
## - Symbol         1   2122.8 2158.8
## - Port           1   2122.9 2158.9
## - Iframe         1   2123.2 2159.2
## - AbnormalURL    1   2124.5 2160.5
## - RequestURL     1   2125.0 2161.0
## - StatsReport    1   2127.8 2163.8
## - Favicon        1   2131.5 2167.5
## - DomainRegLen   1   2132.8 2168.8
## - ShortURL       1   2134.9 2170.9
## - HTTPsToken     1   2136.0 2172.0
## - Redirect       1   2138.1 2174.1
## - SFH            2   2175.8 2210.8
## - GoogleIndex    1   2183.1 2219.1
## - DNSRecord      1   2212.8 2248.8
## - HavingIP       1   2237.8 2273.8
## - LinkToPage     2   2248.4 2283.4
## - SubDomain      2   2253.1 2288.1
## - LinksInTag     2   2276.2 2311.2
## - PrefixSuffix   1   2311.1 2347.1
## - WebTraffic     2   2411.3 2446.3
## - AnchorURL      2   3036.3 3071.3
## - HTTPS          2   3114.0 3149.0
## 
## Step:  AIC=2156.73
## Class ~ HavingIP + ShortURL + Symbol + ddRedirecting + PrefixSuffix + 
##     SubDomain + HTTPS + DomainRegLen + Favicon + Port + HTTPsToken + 
##     RequestURL + AnchorURL + LinksInTag + SFH + AbnormalURL + 
##     Redirect + OnMouseover + RightClick + Iframe + AgeOfDomain + 
##     DNSRecord + WebTraffic + PageRank + GoogleIndex + LinkToPage + 
##     StatsReport
## 
##                 Df Deviance    AIC
## - ddRedirecting  1   2122.2 2156.2
## - OnMouseover    1   2122.4 2156.4
## - RightClick     1   2122.6 2156.6
## - AgeOfDomain    1   2122.6 2156.6
## <none>               2121.7 2156.7
## - PageRank       1   2122.9 2156.9
## + LongURL        2   2120.2 2157.2
## + PopUp          1   2121.3 2157.3
## + SubEmail       1   2121.7 2157.7
## - Symbol         1   2124.2 2158.2
## - Iframe         1   2124.7 2158.7
## - Port           1   2124.7 2158.7
## - AbnormalURL    1   2125.6 2159.6
## - RequestURL     1   2127.2 2161.2
## - StatsReport    1   2129.4 2163.4
## - Favicon        1   2133.0 2167.0
## - DomainRegLen   1   2134.7 2168.7
## - ShortURL       1   2135.9 2169.9
## - HTTPsToken     1   2137.1 2171.1
## - Redirect       1   2139.6 2173.6
## - GoogleIndex    1   2185.5 2219.5
## - SFH            2   2188.9 2221.9
## - DNSRecord      1   2214.2 2248.2
## - HavingIP       1   2241.3 2275.3
## - LinkToPage     2   2251.2 2284.2
## - SubDomain      2   2253.8 2286.8
## - LinksInTag     2   2279.4 2312.4
## - PrefixSuffix   1   2316.2 2350.2
## - WebTraffic     2   2411.3 2444.3
## - AnchorURL      2   3045.8 3078.8
## - HTTPS          2   3119.6 3152.6
## 
## Step:  AIC=2156.2
## Class ~ HavingIP + ShortURL + Symbol + PrefixSuffix + SubDomain + 
##     HTTPS + DomainRegLen + Favicon + Port + HTTPsToken + RequestURL + 
##     AnchorURL + LinksInTag + SFH + AbnormalURL + Redirect + OnMouseover + 
##     RightClick + Iframe + AgeOfDomain + DNSRecord + WebTraffic + 
##     PageRank + GoogleIndex + LinkToPage + StatsReport
## 
##                 Df Deviance    AIC
## - OnMouseover    1   2122.8 2155.8
## - RightClick     1   2123.1 2156.1
## - AgeOfDomain    1   2123.1 2156.1
## <none>               2122.2 2156.2
## - PageRank       1   2123.4 2156.4
## + PopUp          1   2121.7 2156.7
## + ddRedirecting  1   2121.7 2156.7
## + LongURL        2   2121.1 2157.1
## + SubEmail       1   2122.2 2157.2
## - Symbol         1   2124.7 2157.7
## - Iframe         1   2125.1 2158.1
## - Port           1   2125.2 2158.2
## - AbnormalURL    1   2125.9 2158.9
## - RequestURL     1   2127.9 2160.9
## - StatsReport    1   2129.7 2162.7
## - Favicon        1   2133.4 2166.4
## - DomainRegLen   1   2135.0 2168.0
## - ShortURL       1   2136.8 2169.8
## - HTTPsToken     1   2137.3 2170.3
## - Redirect       1   2145.0 2178.0
## - SFH            2   2189.4 2221.4
## - GoogleIndex    1   2189.1 2222.1
## - DNSRecord      1   2220.1 2253.1
## - HavingIP       1   2246.8 2279.8
## - LinkToPage     2   2253.4 2285.4
## - SubDomain      2   2253.8 2285.8
## - LinksInTag     2   2279.9 2311.9
## - PrefixSuffix   1   2316.4 2349.4
## - WebTraffic     2   2411.6 2443.6
## - AnchorURL      2   3053.1 3085.1
## - HTTPS          2   3120.6 3152.6
## 
## Step:  AIC=2155.75
## Class ~ HavingIP + ShortURL + Symbol + PrefixSuffix + SubDomain + 
##     HTTPS + DomainRegLen + Favicon + Port + HTTPsToken + RequestURL + 
##     AnchorURL + LinksInTag + SFH + AbnormalURL + Redirect + RightClick + 
##     Iframe + AgeOfDomain + DNSRecord + WebTraffic + PageRank + 
##     GoogleIndex + LinkToPage + StatsReport
## 
##                 Df Deviance    AIC
## - RightClick     1   2123.5 2155.5
## - AgeOfDomain    1   2123.6 2155.6
## <none>               2122.8 2155.8
## - PageRank       1   2124.0 2156.0
## + OnMouseover    1   2122.2 2156.2
## + ddRedirecting  1   2122.4 2156.4
## + PopUp          1   2122.6 2156.6
## + SubEmail       1   2122.7 2156.7
## + LongURL        2   2121.7 2156.7
## - Iframe         1   2125.1 2157.1
## - Symbol         1   2125.4 2157.4
## - Port           1   2126.0 2158.0
## - AbnormalURL    1   2126.0 2158.0
## - RequestURL     1   2128.6 2160.6
## - StatsReport    1   2130.5 2162.5
## - Favicon        1   2133.8 2165.8
## - DomainRegLen   1   2135.4 2167.4
## - ShortURL       1   2137.4 2169.4
## - HTTPsToken     1   2138.4 2170.4
## - Redirect       1   2145.2 2177.2
## - SFH            2   2190.1 2221.1
## - GoogleIndex    1   2189.7 2221.7
## - DNSRecord      1   2221.4 2253.4
## - HavingIP       1   2248.5 2280.5
## - SubDomain      2   2253.8 2284.8
## - LinkToPage     2   2256.5 2287.5
## - LinksInTag     2   2280.6 2311.6
## - PrefixSuffix   1   2317.6 2349.6
## - WebTraffic     2   2412.1 2443.1
## - AnchorURL      2   3054.0 3085.0
## - HTTPS          2   3124.7 3155.7
## 
## Step:  AIC=2155.52
## Class ~ HavingIP + ShortURL + Symbol + PrefixSuffix + SubDomain + 
##     HTTPS + DomainRegLen + Favicon + Port + HTTPsToken + RequestURL + 
##     AnchorURL + LinksInTag + SFH + AbnormalURL + Redirect + Iframe + 
##     AgeOfDomain + DNSRecord + WebTraffic + PageRank + GoogleIndex + 
##     LinkToPage + StatsReport
## 
##                 Df Deviance    AIC
## - AgeOfDomain    1   2124.3 2155.3
## <none>               2123.5 2155.5
## + RightClick     1   2122.8 2155.8
## - PageRank       1   2124.8 2155.8
## - Iframe         1   2125.1 2156.1
## + OnMouseover    1   2123.1 2156.1
## + ddRedirecting  1   2123.1 2156.1
## + PopUp          1   2123.4 2156.4
## + SubEmail       1   2123.5 2156.5
## + LongURL        2   2122.5 2156.5
## - Symbol         1   2126.1 2157.1
## - Port           1   2126.5 2157.5
## - AbnormalURL    1   2126.7 2157.7
## - RequestURL     1   2129.1 2160.1
## - StatsReport    1   2131.7 2162.7
## - Favicon        1   2134.4 2165.4
## - DomainRegLen   1   2136.6 2167.6
## - ShortURL       1   2138.0 2169.0
## - HTTPsToken     1   2139.6 2170.6
## - Redirect       1   2146.6 2177.6
## - SFH            2   2190.9 2220.9
## - GoogleIndex    1   2190.9 2221.9
## - DNSRecord      1   2222.0 2253.0
## - HavingIP       1   2248.7 2279.7
## - SubDomain      2   2255.1 2285.1
## - LinkToPage     2   2257.7 2287.7
## - LinksInTag     2   2282.0 2312.0
## - PrefixSuffix   1   2317.9 2348.9
## - WebTraffic     2   2413.5 2443.5
## - AnchorURL      2   3055.5 3085.5
## - HTTPS          2   3126.9 3156.9
## 
## Step:  AIC=2155.34
## Class ~ HavingIP + ShortURL + Symbol + PrefixSuffix + SubDomain + 
##     HTTPS + DomainRegLen + Favicon + Port + HTTPsToken + RequestURL + 
##     AnchorURL + LinksInTag + SFH + AbnormalURL + Redirect + Iframe + 
##     DNSRecord + WebTraffic + PageRank + GoogleIndex + LinkToPage + 
##     StatsReport
## 
##                 Df Deviance    AIC
## <none>               2124.3 2155.3
## + AgeOfDomain    1   2123.5 2155.5
## + RightClick     1   2123.6 2155.6
## - Iframe         1   2125.9 2155.9
## + ddRedirecting  1   2123.9 2155.9
## + OnMouseover    1   2124.0 2156.0
## - PageRank       1   2126.1 2156.1
## + PopUp          1   2124.2 2156.2
## + SubEmail       1   2124.3 2156.3
## + LongURL        2   2123.7 2156.7
## - Symbol         1   2126.9 2156.9
## - Port           1   2127.2 2157.2
## - AbnormalURL    1   2127.4 2157.4
## - RequestURL     1   2129.6 2159.6
## - StatsReport    1   2132.4 2162.4
## - Favicon        1   2134.9 2164.9
## - DomainRegLen   1   2137.2 2167.2
## - ShortURL       1   2138.6 2168.6
## - HTTPsToken     1   2140.5 2170.5
## - Redirect       1   2146.9 2176.9
## - SFH            2   2192.1 2221.1
## - GoogleIndex    1   2193.1 2223.1
## - DNSRecord      1   2222.8 2252.8
## - HavingIP       1   2249.2 2279.2
## - LinkToPage     2   2257.7 2286.7
## - SubDomain      2   2260.2 2289.2
## - LinksInTag     2   2284.0 2313.0
## - PrefixSuffix   1   2317.9 2347.9
## - WebTraffic     2   2420.4 2449.4
## - AnchorURL      2   3055.8 3084.8
## - HTTPS          2   3126.9 3155.9
summary(WES.STEP)
## 
## Call:
## glm(formula = Class ~ HavingIP + ShortURL + Symbol + PrefixSuffix + 
##     SubDomain + HTTPS + DomainRegLen + Favicon + Port + HTTPsToken + 
##     RequestURL + AnchorURL + LinksInTag + SFH + AbnormalURL + 
##     Redirect + Iframe + DNSRecord + WebTraffic + PageRank + GoogleIndex + 
##     LinkToPage + StatsReport, family = binomial(logit), data = trainingset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3326  -0.0432   0.0000   0.1536   3.1649  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.48947    0.53035 -12.236  < 2e-16 ***
## HavingIP1       1.84636    0.17138  10.773  < 2e-16 ***
## ShortURL-1      1.20373    0.32761   3.674 0.000239 ***
## Symbol-1       -0.36210    0.22710  -1.594 0.110826    
## PrefixSuffix1  18.18982  259.75289   0.070 0.944172    
## SubDomain0     -0.00312    0.14290  -0.022 0.982583    
## SubDomain1      1.42462    0.14527   9.806  < 2e-16 ***
## HTTPS1          3.30811    0.13317  24.842  < 2e-16 ***
## HTTPS0         -2.09365    0.37455  -5.590 2.27e-08 ***
## DomainRegLen1  -0.54856    0.15369  -3.569 0.000358 ***
## Favicon-1       0.89764    0.28512   3.148 0.001642 ** 
## Port-1         -0.64348    0.38109  -1.689 0.091308 .  
## HTTPsToken1    -1.14285    0.28434  -4.019 5.84e-05 ***
## RequestURL-1   -0.32761    0.14271  -2.296 0.021697 *  
## AnchorURL0      5.08489    0.32369  15.709  < 2e-16 ***
## AnchorURL1      7.00139    0.37002  18.922  < 2e-16 ***
## LinksInTag-1   -1.21232    0.16291  -7.442 9.93e-14 ***
## LinksInTag0     0.32843    0.16956   1.937 0.052759 .  
## SFH1            1.17393    0.19381   6.057 1.39e-09 ***
## SFH0            1.42952    0.25628   5.578 2.43e-08 ***
## AbnormalURL1   -0.54517    0.31358  -1.739 0.082115 .  
## Redirect1      -1.08024    0.23123  -4.672 2.99e-06 ***
## Iframe-1        0.41720    0.33864   1.232 0.217946    
## DNSRecord1      1.63773    0.16982   9.644  < 2e-16 ***
## WebTraffic0    -1.69961    0.19228  -8.839  < 2e-16 ***
## WebTraffic1     0.59798    0.16938   3.530 0.000415 ***
## PageRank1       0.18496    0.13933   1.328 0.184332    
## GoogleIndex-1  -1.30378    0.15897  -8.201 2.38e-16 ***
## LinkToPage0    -1.83713    0.16704 -10.998  < 2e-16 ***
## LinkToPage-1   -1.44360    0.26838  -5.379 7.49e-08 ***
## StatsReport1    0.65986    0.23214   2.842 0.004476 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10621.8  on 7737  degrees of freedom
## Residual deviance:  2124.3  on 7707  degrees of freedom
## AIC: 2186.3
## 
## Number of Fisher Scoring iterations: 18
1 - (WES.STEP$deviance/WES.STEP$null.deviance) # McFadden R^2
## [1] 0.800001

And we apply our regression model into our test data.


predict1_reg <- predict(WES.STEP,newdata=testset[,-31],type="response")
head(predict1_reg)
##           1           2           9          20          23          25 
## 0.001517046 0.721022036 0.995673838 0.454112634 0.349650039 0.997354470

The object predict1.reg is a vector that holds the predicted Wesbrook outcomes in the test data. The values are probabilities between 0 to 1 (due to the argument type=‘response’).

Let us change the probability into categorical values between 0 to 1.

predict1_reg<-ifelse(predict1_reg>0.5, 1, 0)
head(predict1_reg)
##  1  2  9 20 23 25 
##  0  1  1  0  0  1

\[ \hat{C}(x) = \begin{cases} 1 & \hat{p}(x) > 0.5 \\ 0 & \hat{p}(x) \leq 0.5 \end{cases} \]

c5<-confusionMatrix(factor(predict1_reg),factor(testset$Class))
c5
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1333   81
##          1  147 1755
##                                           
##                Accuracy : 0.9312          
##                  95% CI : (0.9221, 0.9396)
##     No Information Rate : 0.5537          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8603          
##                                           
##  Mcnemar's Test P-Value : 1.672e-05       
##                                           
##             Sensitivity : 0.9007          
##             Specificity : 0.9559          
##          Pos Pred Value : 0.9427          
##          Neg Pred Value : 0.9227          
##              Prevalence : 0.4463          
##          Detection Rate : 0.4020          
##    Detection Prevalence : 0.4264          
##       Balanced Accuracy : 0.9283          
##                                           
##        'Positive' Class : 0               
## 

We can notice that our regression model has good results and can compete with tree accuracy scores.


Neural Network

Neural network consists of a collection of elements that are highly interconnected and change a set of inputs to a set of desired outputs. The result of the change is dictated by the characteristics of the elements and the weights associated with the interconnections among them. A neural network directs an analysis of the information and provides a probability estimate that it matches with the data it has been trained to recognize. The neural system picks up the experience by training the system with both the input and output of the desired problem. The network configuration is refined until satisfactory results are obtained. The neural network gains experience over a period as it is being trained on the data related to the problem.

Model 6 : Neural network with H2o

We use here a famous R package H2o in order to build a neuron model network.

Let us first initiate the environment of H2o and build some dataset.

h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 days 21 hours 
##     H2O cluster timezone:       Europe/Paris 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.30.0.2 
##     H2O cluster version age:    1 month and 6 days  
##     H2O cluster name:           H2O_started_from_R_swp_amr953 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.53 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.3 (2020-02-29)
h2o.train <- as.h2o(trainingset)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
h2o.test <- as.h2o(testset)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

Let us build our model with described parameters.


h2o.model <-  h2o.deeplearning(x = setdiff(names(trainingset), c("Class")),
                              y = "Class",
                              training_frame = h2o.train,
                              standardize = TRUE,         # standardize data
                              hidden = c(100, 100,100),  # 3 layers of 100 nodes each
                              rate = 0.01,                # learning rate
                              epochs = 1000,                # iterations/runs over data (we choose a high number of runs in purpose since we want our model to be competitive in front of tree classification)
                              seed = 1234                 # reproducability seed
                              )
## Warning in .h2o.processResponseWarnings(res): rate cannot be specified if adaptive_rate is enabled..
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |======================================================================| 100%

And apply our model to our test dataset.

h2o.prediction <- as.data.frame(h2o.predict(h2o.model, h2o.test))
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%


c6<-confusionMatrix(factor(h2o.prediction$predict),factor(testset$Class))
c6
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1401   52
##          1   79 1784
##                                           
##                Accuracy : 0.9605          
##                  95% CI : (0.9533, 0.9669)
##     No Information Rate : 0.5537          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9199          
##                                           
##  Mcnemar's Test P-Value : 0.02311         
##                                           
##             Sensitivity : 0.9466          
##             Specificity : 0.9717          
##          Pos Pred Value : 0.9642          
##          Neg Pred Value : 0.9576          
##              Prevalence : 0.4463          
##          Detection Rate : 0.4225          
##    Detection Prevalence : 0.4382          
##       Balanced Accuracy : 0.9591          
##                                           
##        'Positive' Class : 0               
## 

Model 7 : Neural Network with NNet


model_nnet<-nnet(Class ~. , data=trainingset,size=10, maxit = 500)
## # weights:  401
## initial  value 7229.216008 
## iter  10 value 1378.974336
## iter  20 value 1156.017874
## iter  30 value 1072.220165
## iter  40 value 908.194273
## iter  50 value 779.870238
## iter  60 value 688.362668
## iter  70 value 598.045200
## iter  80 value 517.569671
## iter  90 value 457.772468
## iter 100 value 422.330390
## iter 110 value 402.950924
## iter 120 value 389.435807
## iter 130 value 378.942650
## iter 140 value 370.599448
## iter 150 value 364.212302
## iter 160 value 362.498818
## iter 170 value 360.472550
## iter 180 value 357.991926
## iter 190 value 355.873394
## iter 200 value 355.194006
## iter 210 value 354.455385
## iter 220 value 353.757369
## iter 230 value 352.406584
## iter 240 value 351.783645
## iter 250 value 351.207384
## iter 260 value 350.832206
## iter 270 value 350.633662
## iter 280 value 350.337568
## iter 290 value 350.035898
## iter 300 value 349.689900
## iter 310 value 349.259129
## iter 320 value 348.641011
## iter 330 value 348.568714
## iter 340 value 348.167418
## iter 350 value 347.882373
## iter 360 value 347.708100
## iter 370 value 347.441220
## iter 380 value 347.261646
## iter 390 value 347.010012
## iter 400 value 346.738320
## iter 410 value 346.468868
## iter 420 value 346.046006
## iter 430 value 345.932199
## iter 440 value 345.669345
## iter 450 value 345.463716
## iter 460 value 345.355821
## iter 470 value 344.933877
## iter 480 value 344.610040
## iter 490 value 344.383182
## iter 500 value 344.126434
## final  value 344.126434 
## stopped after 500 iterations
pred_nnet <- predict(model_nnet, testset[,-31],type = "class")

c7<-confusionMatrix(factor(pred_nnet),factor(testset$Class))
c7
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1390   70
##          1   90 1766
##                                           
##                Accuracy : 0.9517          
##                  95% CI : (0.9439, 0.9588)
##     No Information Rate : 0.5537          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9022          
##                                           
##  Mcnemar's Test P-Value : 0.1331          
##                                           
##             Sensitivity : 0.9392          
##             Specificity : 0.9619          
##          Pos Pred Value : 0.9521          
##          Neg Pred Value : 0.9515          
##              Prevalence : 0.4463          
##          Detection Rate : 0.4192          
##    Detection Prevalence : 0.4403          
##       Balanced Accuracy : 0.9505          
##                                           
##        'Positive' Class : 0               
## 

5. Model assessment

Let us plot all our confusion matrix in order to choose the best model.

c1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1323  164
##          1  157 1672
##                                           
##                Accuracy : 0.9032          
##                  95% CI : (0.8926, 0.9131)
##     No Information Rate : 0.5537          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8042          
##                                           
##  Mcnemar's Test P-Value : 0.7377          
##                                           
##             Sensitivity : 0.8939          
##             Specificity : 0.9107          
##          Pos Pred Value : 0.8897          
##          Neg Pred Value : 0.9142          
##              Prevalence : 0.4463          
##          Detection Rate : 0.3990          
##    Detection Prevalence : 0.4484          
##       Balanced Accuracy : 0.9023          
##                                           
##        'Positive' Class : 0               
## 
c2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1246   68
##          1  234 1768
##                                           
##                Accuracy : 0.9089          
##                  95% CI : (0.8986, 0.9185)
##     No Information Rate : 0.5537          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8137          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8419          
##             Specificity : 0.9630          
##          Pos Pred Value : 0.9482          
##          Neg Pred Value : 0.8831          
##              Prevalence : 0.4463          
##          Detection Rate : 0.3758          
##    Detection Prevalence : 0.3963          
##       Balanced Accuracy : 0.9024          
##                                           
##        'Positive' Class : 0               
## 
c3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1401   37
##          1   79 1799
##                                          
##                Accuracy : 0.965          
##                  95% CI : (0.9582, 0.971)
##     No Information Rate : 0.5537         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.929          
##                                          
##  Mcnemar's Test P-Value : 0.0001408      
##                                          
##             Sensitivity : 0.9466         
##             Specificity : 0.9798         
##          Pos Pred Value : 0.9743         
##          Neg Pred Value : 0.9579         
##              Prevalence : 0.4463         
##          Detection Rate : 0.4225         
##    Detection Prevalence : 0.4337         
##       Balanced Accuracy : 0.9632         
##                                          
##        'Positive' Class : 0              
## 
c4
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3316   77
##          1  102 4243
##                                           
##                Accuracy : 0.9769          
##                  95% CI : (0.9733, 0.9801)
##     No Information Rate : 0.5583          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9531          
##                                           
##  Mcnemar's Test P-Value : 0.07284         
##                                           
##             Sensitivity : 0.9702          
##             Specificity : 0.9822          
##          Pos Pred Value : 0.9773          
##          Neg Pred Value : 0.9765          
##              Prevalence : 0.4417          
##          Detection Rate : 0.4285          
##    Detection Prevalence : 0.4385          
##       Balanced Accuracy : 0.9762          
##                                           
##        'Positive' Class : 0               
## 
c5
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1333   81
##          1  147 1755
##                                           
##                Accuracy : 0.9312          
##                  95% CI : (0.9221, 0.9396)
##     No Information Rate : 0.5537          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8603          
##                                           
##  Mcnemar's Test P-Value : 1.672e-05       
##                                           
##             Sensitivity : 0.9007          
##             Specificity : 0.9559          
##          Pos Pred Value : 0.9427          
##          Neg Pred Value : 0.9227          
##              Prevalence : 0.4463          
##          Detection Rate : 0.4020          
##    Detection Prevalence : 0.4264          
##       Balanced Accuracy : 0.9283          
##                                           
##        'Positive' Class : 0               
## 
c6
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1401   52
##          1   79 1784
##                                           
##                Accuracy : 0.9605          
##                  95% CI : (0.9533, 0.9669)
##     No Information Rate : 0.5537          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9199          
##                                           
##  Mcnemar's Test P-Value : 0.02311         
##                                           
##             Sensitivity : 0.9466          
##             Specificity : 0.9717          
##          Pos Pred Value : 0.9642          
##          Neg Pred Value : 0.9576          
##              Prevalence : 0.4463          
##          Detection Rate : 0.4225          
##    Detection Prevalence : 0.4382          
##       Balanced Accuracy : 0.9591          
##                                           
##        'Positive' Class : 0               
## 
c7
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1390   70
##          1   90 1766
##                                           
##                Accuracy : 0.9517          
##                  95% CI : (0.9439, 0.9588)
##     No Information Rate : 0.5537          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9022          
##                                           
##  Mcnemar's Test P-Value : 0.1331          
##                                           
##             Sensitivity : 0.9392          
##             Specificity : 0.9619          
##          Pos Pred Value : 0.9521          
##          Neg Pred Value : 0.9515          
##              Prevalence : 0.4463          
##          Detection Rate : 0.4192          
##    Detection Prevalence : 0.4403          
##       Balanced Accuracy : 0.9505          
##                                           
##        'Positive' Class : 0               
## 

Since we have chosen to use a simple metric to evolve our model as our dataset was balanced, we will not use ROC curve or AIC but simply keep the best model accuracy. From above confusion matrix, we can class our models by performance into following order : first comes Boosted Tree, then RandomForest, and Neural Network models. Regression model provided better accuracy than our two first classification trees.


6. Summary and Future Work

Although the performance of our seven different machine learning methods used is quite comparable, we found that Boosed Trees model achieved
the best results. We have found that a simple regression method can sometimes be better than certain types of trees when it comes to classification and that neural networks achieve competitive results. Our results demonstrate the potential of using learning machines in detecting and classifying phishing website.

One interesting future development would be to build an online Website Phishing Detector into a Web-Application using those models with Rshiny.

Here an example found of such a website and some screenshots of it. https://malicious-url-detectorv5.herokuapp.com/

Homepage - An Example of Phishing Dectector Web-Application

Homepage - An Example of Phishing Dectector Web-Application

Malicious Website - An Example of Phishing Dectector Web-Application

Malicious Website - An Example of Phishing Dectector Web-Application

Building such an application would to be very interesting because it would make it possible to use model in practical cases and validate our conclusion regarding model performance score with new input data. I began to do some research in order to develop such a website that and I propose here identified major steps to implement that.

  • Within A RShiny script

  • Use web-scrapper R package “rves”.

  • Extract all wanted information include in our 30 features.

  • Build function if else based on the rules used to build the dataset. like this python script : https://github.com/srimani-programmer/Phishing-URL-Detector/blob/master/feature_extraction.py

  • https://phishtank.com/index.php provides url of detected phishing dataset to feed our models.

  • Once our dataset is rebuilt Use our models to predict if the input url is phishing or not and print accuracy score.

  • Show results on a r hiny server web-page.


Acknowledgements and references

The following is a list of helpful contributor.

  • Dr Jaroslaw Jozef Olejniczak- Big Data - SGH Spring 2020
  • Professor Lukasz Krainski- Statistical Learning Methods [223490-0286] - Summer semester 2019/20
  • Github https://github.com/srimani-programmer/Phishing-URL-Detector/blob/master/feature_extraction.py
  • Another example using webscrapper https://medium.com/swlh/supervised-learning-to-detect-phishing-urls-d0779d360dc8
  • Using project framework of https://medium.com/intel-student-ambassadors/using-ai-for-managing-renewable-energy-generation-and-management-c7be86cde760
  • Stack Overflow community

Thank you for reading !